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Reliable broadcast protocols are important tools in distributed and fault-tolerant 
programming. They are useful for sharing information and for maintaining repli- 
cated data in a distributed system. However, a wide range of such protocols has 
been proposed. These protocols differ in their fault tolerance and delivery ordering 
characteristics. There is a tradeoff between the cost of a broadcast protocol and 
how much ordering it provides. It is, therefore, desirable to employ protocols that 
support only a low degree of ordering whenever possible. This dissertation presents 
techniques for deciding how strongly ordered a protocol is necessary to solve a given 
application problem. 

( We-shew that there are two distinct classes of application problems: problems 
that can be solved with efficient, asynchronous protocols, and problems that require 
global ordering. Weintroduce the concept of a linearization function that maps par- 
tially ordered sets of events to totally ordered histone^ W«-®how how to construct 
an asynchronous implementation that solves a given problem if a linearization func- 
tion for it can be found. 

W epr oy e that in general the question of whether a problem has an asynchronous 



solution is undeddable. Hence there exists no general algorithm that would auto- 
matically construct a suitable linearization function for a given problem. Therefore, 
we consider- an important subclass of problems that have certain commutativity 
properties. W e ptoo e n t techniques for constructing asynchronous implementations 
for this class, These techniques are useful for constructing efficient asynchronous 
implementations for a broad range of practical problems. 
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Chapter 1 
Introduction 


Broadcast protocols are useful tools for distributed and fault- tolerant programming. 
However, a wide range of such protocols has been proposed, differing in their fault 
tolerance and delivery ordering characteristics. This thesis describes techniques for 
chosing the type of broadcast that will maximize the performance of an application 
without compromising its correctness. 

This work was motivated by the ISIS system, a toolkit for building fault-tolerant 
distributed applications. All took provided by ISIS are based on a set of broadcast 
communication primitives. These primitives as well as the took built from them 
are made available to the application programmer. One objective of this thesis is 
to gain an understanding of the theoretical foundations of the ISIS system, and 
thereby help the programmer in selecting and using the took provided by ISIS. The 
interested reader k referred to [BJ87a,BJS88] for a description of ISIS. 
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Figure 1.1: Distributed system 


1.1 Distributed Systems 

A distributed system consists of a set of independent processors, p \, . . . ,p n , con- 
nected by a communication network (see Figure 1.1). In such a system, processors 
exchange information only by sending messages. There are several parameters that 
determine the characteristics of the system: 

• Network topology: Some pairs of processors can communicate directly. 
Messages between other pairs of processors have to be routed through one 
or more intermediate processors. A network in which every pair of processors 
can co mmuni cate directly is called completely connected. 
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• Message ordering: Some communication protocols do not make any guaran- 
tees about the order in which messages are delivered. Other protocols provide 
FIFO ordering, i.e., messages are delivered in the order they were sent. 

• Message reliability: Message channels may be reliable (all messages sent 
are delivered correctly), subject to omission failures (messages may be lost), 
or subject to Byzantine failures (messages may be lost or corrupted). Fur- 
thermore, there may or may not be an upper bound on the time between the 
sending of a message and its delivery. 

• Processor reliability: A processor may fail in several ways. It may stop with- 
out taking incorrect actions ( fail-stop ) [SS83], fail to send or receive some mes- 
sages ( omission fault) [Had84], or behave arbitrarily ( Byzantine fault) [LSP82, 
SD83). 

In this dissertation we assume a completely connected network with reliable message 
delivery. This decision is justifiable on practical grounds: data-link protocols and 
network routing protocols satisfying these assumptions are well understood [Tan81]. 
In general, we do not assume an upper bound on message delays, nor do we assume 
that processors have synchronized clocks. A system with these characteristics (un- 
bounded message delays, no synchronized clocks) is called asynchronous. 

Processors may experience failures, but we restrict ourselves to non-byzantine 
failure modes. Processor omission faults can be treated in the same way as the 
loss of messages in the communication network. Therefore, we consider only crash 
failures (fail-stop processors). 
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1.2 Objectives 

Many applications running in a distributed system require processors to share in- 
formation. Often it is also desirable to replicate information at different sites to 
avoid data loss should a failure occur. Useful tools for sharing information and for 
maintaining replicated data are reliable broadcast protocols. Such protocols prop- 
agate information from one processor to a set of destination processors in such a 
way that all operational destinations receive this information despite failures in the 
system. This property is called reliable message delivery. In addition to this, a 
broadcast protocol may also provide a form of message ordering. The strongest 
form is atomic ordering. An atomic broadcast protocol guarantees that all messages 
are received in the same order everywhere. An example of a weaker form of ordering 
is FIFO. A FIFO broadcast guarantees that two messages sent by the same processor 
are received everywhere in the order they were sent. Messages sent by different 
processors, however, may be received in different orders at different sites. 

There is a tradeoff between how much ordering a protocol provides and how much 
synchronization delay is necessary to implement this ordering. A FIFO broadcast, 
for example, can be implemented efficiently on top of unordered message channels 
by adding a sequence number to every message. An atomic broadcast, on the other 
hand, is much more costly to implement in the systems we study. It requires two 
or more phases of message exchanges between processors before a message can 
be delivered. It is, therefore, desirable to employ protocob that support only a 
low degree of ordering whenever possible. This dissertation presents techniques 
for deciding how strongly ordered a protocol has to be in order to solve a given 
application problem. 
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1.3 Outline 

This dissertation consists of seven chapters. Chapter 2 describes several different 
forms of broadcast protocols known in the literature and discusses their benefits 
and costs. 

Chapter 3 introduces a formalism for specifying an application problem in a 
distributed system, and presents a model for broadcast-based implementations that 
solve such problems. 

Chapter 4 investigates conditions under which a specification has an asyn- 
chronous implementation. It is shown that if such an implementation exists, it 
can be expressed in a canonical form. 

Chapter 5 proves that in general the existence of an an asynchronous imple- 
mentation for a given problem is undecidable. However, we identify a subclass 
of specifications that captures a broad range of practical problems. The defining 
characteristic of specifications in this class is that they have certain commutativ- 
ity properties. We describe methods for finding asynchronous implementations for 
specifications in this class. 

Chapter 6 examines how processor failures can be integrated into our model and 
shows how this affects the results of Chapter 4 and Chapter 5. 

Chapter 7 summarizes our results and discusses future extensions of our work. 

Throughout the thesis we use an example that is first introduced in Chapter 3. 
Appendix A contains a comprehensive presentation of this example in which we 
collect the different elements addressed throughout the thesis into one discussion. 


Chapter 2 

Reliable Broadcast Protocols 


Because this dissertation is about selecting among different forms of broadcast pro- 
tocols, we devote this chapter to giving an overview of several variants of broadcast 
protocol . 1 Such protocols have two distinct properties: 

• Reliability: The protocol ensures that a message that is broadcast will even- 
tually reach all its destinations, even if failures occur while the protocol is 
running. 

• Ordering: Some protocols make guarantees about the order in which different 
broadcast messages are received at different destination sites. 

We will describe how these features can be implemented on top of a network that 
provides only point-to-point communication between processors. In our discussion 
we will first concentrate on the ordering aspect. The following section will present 
different ordering properties and describe how such properties can be implemented 

^he term “broadcast” is often used to mean that a message is sent to all pro- 
cessors in the system. We will use it in the more general sense of sending a message 
to some subset of all processors. This is often called a multicast 
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in a completely reliable system in which processors do not fail. In Section 2.2 we 
will examine how the different types of protocols can be made fault- tolerant. 

2.1 Broadcast Ordering 

2.1.1 Unordered Broadcast 

The simplest way of broadcasting a message is to just send a copy of that 
message to every destination processor individually. This form of broadcast does 
not provide any form of ordering. Figure 2.1 illustrates this. It shows a system with 
four processors. Time proceeds from left to right, and the diagonal lines represent 
messages. The figure shows p\ broadcasting two messages (a and b) to pj, p$, and 
P 4 . The two messages arrive in the same order at pj and pz, but p\ receives them 
in a different order. Because this form of broadcast does not guarantee any specific 
order of delivery, we call it an unordcrtd broadcast, or simply BCAST. 

2.1.2 Fifo Broadcast 

If the underlying communication network provides FIFO message channels, then 
the protocol just described will satisfy a stronger ordering property: All messages 
broadcast by the same processor will be delivered in the same order everywhere, 
namely the order they are sent. Even if the network does not provide FIFO mes- 
sage channels, it is not difficult to implement FIFO ordering by adding a sequence 
number to every message [Tan81]. We call this a FIFO broadcast, or FBCAST for 
short. In Figure 2.2, for example, processor p\ broadcasts two messages, first a 
then b. Processors pz and pi receive these messages in the order they were sent. 
Broadcasts Bi and B%, however, are sent by different processors; such broadcasts 
may be delivered in different orders at different sites, as shown in the example. 
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Figure 2.1: Unordered broadcast 


B\\ a Bf.b 



Figure 2.2: FIFO broadcast 
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2.1.3 Atomic Broadcast 

^ -acasts axe often used for updating information that is replicated at several 
sites. FIFO ordering may not be enough if different processors broadcast update 
messages independently. In this situation, two update messages could be delivered 
in different orders at different sites, leading to inconsistencies. This can be avoided 
by using a stronger protocol that would guarantee that all messages are delivered 
in the same order everywhere, even if they were sent independently by different 
processors. Such a protocol is called an atomic broadcast protocol, or ABCAST 
for short. Figure 2.3 illustrates the behavior of an ABCAST. It shows two messages 
broadcast independently by p\ and pj. Both messages are received in the same 
order at p 3 and p* (first 6, then a in this example). 

There are several well known techniques for implementing ABCAST in asyn- 
chronous systems. Figure 2.4 illustrates a protocol due to Chang and Maxemchuk 
[CM84] in which every message is broadcast in two phases. A processor wishing to 
broadcast a message sends this message to one distinguished processor, say p\ (first 
phase), pi then forwards the message to its destinations by means of a FBCAST 
(second phase). This way all broadcast messages are delivered in the order they 
were received and forwarded by pi . A different, more symmetric atomic broadcast 
protocol due to Skeen is described in [BJ87bj. This method uses a three-phase 
protocol as illustrated in Figure 2.5. Every processor maintains a message delivery 
queue; when a broadcast message is received (phase one of the protocol), it is added 
to the queue, but not yet delivered to the application program running at that site. 
The recipient assigns a temporary “priority number” to the message and returns 
this number to the sender of the broadcast (phase two). The recipient chooses this 
number to be larger than any number assigned to messages currently queued or 
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B\ : a 



Figure 2.3: Atomic broadcast 



Figure 2.4: Two-phase implementation of atomic broadcast 
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previously delivered. The sender collects all priority numbers, computes their max- 
imum and sends tin a number to all destination sites (phase three). Every recipient 
replaces the temporary priority number by the number just received, and reorders 
the queue accordingly. A message is delivered when it has received its final priority 
number and no messages with smaller priority number are in the queue. 

Atomic ordering makes the design of fault-tolerant distributed applications much 
easier, because it reduces the uncertainty caused by message delays and failures in 
the system. However, this benefit does not come without cost. The two protocols 
described above need two or three phases of communication before an ABCAST 
message can be delivered. It is not difficult to prove that in an asynchronous 
system (i.e., a system with unbounded message delays), any protocol that guarantees 
atomic ordering requires some messages to take at least two hops before they are 
delivered. Consider for example a system with two processors, p\ and pj. Processor 
pi broadcasts a message a; at the same time pj broadcasts b. Both message are 
addressed to both processors. We claim that either message a needs at least two 
hops (to p 2 and back to pi) before it can be delivered at pi, or message b needs two 
hops. Assume the protocol delivers a at pi in one hop. This means that pi sends a 
to P 2 , but it delivers the message locally without waiting for a reply from p 2 (See 
Figure 2.6). At the time of this local delivery, pi may not yet know that pj has sent 
a broadcast. If the message b horn p 2 to pi is delayed long enough, the protocol 
will deliver a before 6 at pi. Similarly, it is possible that at pj, b will be delivered 
before a. But that would violate atomic ordering. 

The situation is different if there is a known upper bound on message delays. In 
such a system it is possible to maintain synchronized clocks [BD87,LMS85,ST87], In 
this case, atomic ordering can be achieved by a method based on timestamps. The 
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Figure 2.6: Processor pi delivers message a locally without waiting for any messages 
from p 2 - 
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sender of a broadcast adds a timestamp to the message that shows the value of its 
local clock when the message is sent out. Messages received at a destination site are 
delivered to the application program in timestamp order; however, before a message 
is delivered, the processor has to wait until it is certain that no more messages with 
a lower timestamp will arrive. The amount of time to wait depends on the worst 
case message delay and on how closely clocks are synchronized [CASD84]. The 
disadvantage of this approach is that the delivery of every message is delayed by 
the worst case message delay, which is often much larger than the average delay in 
a two or three phase protocol. 

2.1.4 Causal Broadcast 

Because of the inherent cost of atomic broadcast protocols it is natural to look for 
protocols that provide stronger ordering than FBCAST but are less expensive than 
ABCAST. The causal broadcast (CBCAST for short) is such a protocol. It is based 
on the idea of potential causality introduced by Lamport in [Lam78]. 

The flow of information during the execution of a distributed system can be 
used to define a partial order on events occurring in the system. Such events are 
the sending of a message , the receipt of a message , or a local event that only affect 
a single processor. Figure 2.7 illustrates this. Events ei,e 4 ,en, and eu are send - 
events , ej, e*, e 7 , eg, eij, eu, and eis are receive- events, and ez, eg, and eio axe local- 
events. According to Lamport’s definition, all events that are connected by a path in 
this diagram are potentially causally related. Such a path must follow the horizontal 
lines (from left to right) or message arrows. For example, eio is potentially causally 
related to ei, because there is a path from e\ to eio going through e?, e-t, and eg 
(dotted line in the figure). This dependency is denoted by the symbol i.e., 
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Figure 2.7: Potential causality 




B\:a Bi'.b B\:d 



Figure 2.8: Causal broadcast 


15 


e\ —* eio. Events that are not connect by such a path are called concurrent. This 
is denoted by the symbol “//”. For ‘•rample, e$//e\4. The relation is called 
potential causality or information flow relation. The name “potential causality” has 
the following explanation. In physics, the Principle of Causality says that a cause 
has to precede its effect. Similarly, an event a in a distributed system can affect an 
event b at some other processors only if there is a flow of information from a to 6, 
i.e., if a precedes b under 

The ordering properties of a causal broadcast protocol are defined in terms of 
this information flow relation. CBCAST guarantees that every processor receives 
messages in an order that is consistent with That is, whenever two CBCAST 

send events are related by the protocol ensures that the two messages are 

received in the same order everywhere, namely the one given by For example, 
in Figure 2.8, broadcasts B\ and Bz are potentially causally related ( B\ — ► Bi, 
represented by the dotted line in Figure 2.8). Consequently the message a is received 
before c at both pi and p4. Broadcasts Bi and £4, on the other hand, are concurrent 
(B1//B4). Hence the two messages c and d may be received in different orders at 
pi and P4, as shown in this example. Notice that two events at the same processor 
are never concurrent. Therefore a causal broadcast also respects FIFO ordering. For 
example, in Figure 2.8, B\ — ♦ B4; hence message a is received everywhere before d. 

There are several ways of implementing causal broadcast that are very similar 
to the use of sequence numbers in FBCAST protocols. A processor wishing to broad- 
cast a message adds some additional dependency information to the message before 
sending it to its destinations. This technique is called “piggybacking”. The infor- 
mation that is added to an outgoing message m consists of a list of other, previously 
received messages that precede m under This form of CBCAST protocol is de- 
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scribed in detail by Birman and Joseph in [BJ87b]. In a system in which no failures 
occur, V ificient to transmit only message-ID’s, instead of piggybacking whole 
messages onto other messages [Pet87]. Using this piggybacking technique causal 
ordering can be achieved without multiple phases of message exchanges. 

2.2 Reliability 

The broadcast protocols as we described them in the previous section only work 
correctly if no failures occur. Consider for example the BCAST protocol. If the 
sender crashes in the middle of the protocol, the message will reach only a subset 
of the destination sites. The situation is even worse for the three-phase ABCAST 
protocol. The failure of a single destination site can cause the protocol to block, 
preventing all other broadcasts from being received. 

So-called reliable broadcast protocols avoid this undesirable behavior. A reliable 
broadcast guarantees that every message sent will eventually be received by all 
operational destination sites, despite processor failures. We have to qualify this 
statement a little. Under certain failure patterns, no protocol can guarantee the 
delivery of a broadcast to all operational destinations. For example, the sender 
could crash before it actually sent out any messages. Even if the sender manged 
to co mmuni cate with some other processor before it crashed, this other processor 
could experience a failure before talking to anybody else. In general, a set of failures 
in an early stage of a broadcast protocol could wipe out all knowledge about the 
message to be sent. What we mean by reliable message delivery is that a message is 
delivered to all operational destinations unless the sender fails before the protocol 
has terminated. Furthermore, in case the sender fails at some time during the 
protocol, message delivery must be all-or-nothing. More precisely: 
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If processor p sends a message m to a set D of destination sites, then 
the system will eventually reach one of the fell ving two states: 

1. For all q 6 D: q has received m or q has crashed. 

2. Processor p has crashed, and for all q € D: q has crashed or q will 
never receive m. 

This property is also called atomic message delivery. 

We will now look at the different types of broadcast protocols introduced in the 
previous section (BCAST, FBCAST, CBCAST, ABCAST) and examine how they can be 
made reliable. 

2.2.1 Reliable Beast, F beast, and Cbcast 

The simplest reliable broadcast protocol uses a method called flooding or message 
diffusion. A processor wishing to broadcast a message sends it to all destination 
sites by means of an (unreliable) BCAST. Every processor that receives the message 
forwards it to all other destination sites using BCAST. This way every destination 
will receive multiple copies of the message (one from the sender and one from each 
other destination, if no failures occur); it forwards the message only the first time it 
is received and ignores all duplicates. This protocol achieves atomic delivery: Every 
processor that receives the message will eventually either succeed in forwarding it 
to all other operational destinations or it will fail. Therefore, eventually either 
all operational sites have received the message or all sites that ever received the 
message have crashed. 

By appending a set of sequence numbers to every message, FIFO ordering can 
be added to a diffusion protocol. This way we get a reliable FBCAST. 
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The same technique can be used to make CBCAST reliable. The CBCAST protocols 
we described in Sec cion 2.1.4 work by piggybacking dependency information onto 
the broadcast message to be sent out. The use of message diffusion to propagate this 
message ensures that the original message contents as well as the dependency infor- 
mation are delivered to all operational destination sites. This way causal ordering 
can be preserved despite processor failures. For details see [BJ87b]. 

2.2.2 Reliable Abcast 

ABCAST is a form of consensus protocol, because atomic ordering requires all pro- 
cessors to agree on total order on all broadcasts. In [FLP85) Fisher, Lynch, and 
Paterson show that it is impossible to achieve consensus in an asynchronous sys- 
tem if failures occur. Consequently, it is not possible to implement reliable atomic 
broadcast in such a system. The reason for this is that if no upper bound on 
message delays is known, a processor failure is indistinguishable from very slow 
communication. For example, consider a system with two processors p\ and p?. It 
is not difficult to prove the impossibility result of [FLP85] for this example: Assume 
processor p\ broadcasts a message a with destination pi,P 2 . At the same time p? 
sends a message 6, also addressed to both processors. Consider the following three 
scenarios: 

1. Processor pi crashes before sending any messages; pi does not fail. Then the 
message a must eventually (say after some time interval d\ ) be delivered at 
Pi- 

2. Processor pi crashes before sending any messages; p 2 does not fail. Then the 
message b must eventually (say after some time interval dj) be delivered at 


P2- 
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3. Neither of the two processors fails, but the communication network is very 
slow; every messages takes at least time d = max(d\,di) before it is received. 

Up to time d, processor p\ cannot distinguish Scenario 3 from 1. In both cases it 
has not yet received any messages from pj, but it does not know if p? has crashed or 
is still alive. Therefore, in Scenario 3, p\ will deliver message a after time d\ , before 
receiving any messages from pi. Similarly b will be delivered at pi before p? receives 
any messages from p \ . But then atomic ordering is violated in this scenario. 

Therefore reliable atomic broadcast can only be achieved if we relax the assump- 
tions about the asynchrony of the system. There are two ways of doing this: 

1. Assume that failures can be detected. The ABC AST protocols described in 
[CM84] and [BJ87b] achieve reliability under this assumption. If a proces- 
sor participating in an ABCAST protocol experiences a failure, some other 
processor can take over and complete the protocol on behalf of the crashed 
processor. 

2. Assume there is an upper bound on message delays. In this case a reliable 
atomic broadcast can be implemented by combining a diffusion protocol with 
the method of timestamps to achieve atomic ordering [CASD84]. However, 
the amount of time that a processor has to wait before a message can be 
delivered to the application program increases with the number of expected 
failures. 

Notice that the second assumption implies the first. If message delays are bounded, 
failures can be detected by timeouts. In fact, most failure detection mechanisms in 
distributed systems rely on timeouts. 
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2.3 Summary 

We examined a variety of reliable broadcast protocols that differ In the form of 
message ordering they provide. 

• Atomic Broadcast (abcast): 

All messages are delivered in the same order everywhere. 

• Causal Broadcast (CBCAST): 

The order in which messages are delivered is consistent with the information 
flow relation between broadcast events. 

• Fifo Broadcast (fbcast): 

Broadcasts by the same processor are delivered in the order sent. 

• Unordered Broadcast (bcast): 

Messages are delivered in an arbitrary order. 

The stronger the the ordering property of the broadcast, the more costly it is to 
implement. An atomic broadcast protocol requires at least two phases of message 
exchange, whereas CBCAST, FBCAST, and BCAST can be implemented as one-phase 
protocols. Furthermore, in an unreliable system in which processors may experience 
failures, ABCAST can only be implemented if failures are detectable or if an upper 
bound on message delays is known. 



Chapter 3 
Formal Model 


In this chapter we present a formalism based on events and histories for specify- 
ing problems in a distributed system. We introduce a model for a broadcast-based 
distributed implementation and give a definition for the correctness of an imple- 
mentation with respect to a problem specification. We illustrate our formalism by 
showing that every formal problem has an implementation based on atomic broad- 
casts. 


3.1 Formal Problem Specifications 

A program running in a distributed system consists of several components, each 
running at a different site, and interacting with each other by sending and receiving 
messages. A formal specification for such a program can be given in terms of 
its input/ output behavior. At each site there are clients (human users or other 
programs) that interact with the distributed program. This interaction is typically 
described by a procedural interface. A client invokes an operation by passing the 
operation name and a set of parameters to the component of the distributed program 
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residing at the local site. The program executes the operation, informs the client of 
its completion, and possibly returns a value o .he client. During the execution of the 
operation the local component of the program may interact with remote components 
of the program. Figure 3.1 illustrates this view of a distributed program. 

We distinguish between the implementation of a distributed program and its 
behavior as observed by its clients. From a client’s point of view the program is 
a service that accepts requests from clients at different sites, executes each request 
and returns the result to the client. Figure 3.2 illustrates this view of a distributed 
program as a centralized service. We use this client view as the basis of our formal 
specifications. 

Definition 3.1 

A formal event 

e = Ai(x \, . . . ,x„) : v 

denotes operation A invoked by client i with parameters x\, . . . , x„, and return- 
ing the value v. A formal history 

H = (cj, e2, • . • , Cm) 

is a finite, totally ordered sequence of events. 

A formal history describes the sequence of operations executed by the service and 
the values returned to the clients. A formal specification determines what constitutes 
correct behavior of the service, by defining which formal histories of the service are 
legal. Since we do not want to commit ourselves to any particular logical language for 
describing specifications, we simply identify a specification with the set of histories 
accepted by it. 
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Figure 3.1: A client interacting with a distributed program 
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Definition 3.2 
H+e or He 
1 HH' 

H< H 1 
B\i 


denotes the history obtained by appending the event e to H. 
denotes the concatenation of H and H' . 
means that if is a prefix of H' . 

denotes the projection of H onto client i, that is the subsequence 
of H containing all operations invoked by client t. 


Definition 3.3 

A formal specification is a quadruple S = (n, I, V, S), where n is the number 
of clients, / is a set of invocations of the form A<(xi, . . . , x*), V is a set of 
return values, and 5 is a set of histories. 5 must satisfy the following two 
properties: 


5 is prefix-closed: VHeS: VH'<H: H' € S, 

5 is complete and deterministic : 

V H 6 S: V invocation a € I: 

3 unique return value v € V : H + a:v € 5. 


At this point it is useful to give an example that illustrates our formalism. We will 
use this example throughout the rest of this dissertation. Consider the problem of 
managing a shared resource in a distributed system. The resource can be accessed 
from any site, but we want to ensure that at any given time only one site actually 
uses the resource. This problem can be solved by introducing the concept of a token 
that is associated with the resource. Only the site that is currently holding the 
token is allowed to access the resource. If the current token holder no longer needs 
the resource it may pass the token to some other site. We want to design a token 
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passing service that manages this token. This service would support the following 
operations: 

• query(): Boolean 

— returns TRUE if the caller is the current token holder. 

• pass(x: ClientId): ReturnCode 

— passes the token from the current token holder to client x. 

The PASS operation returns one of three values: OK, ERRORHOLDER (the 
caller is not the current token holder), or ERRORREQUEST (client x did not 
request the token). 

• request(): ReturnCode 

— request the token} 

The REQUEST operation returns one of three values: OK, ERRORHOLDER (the 
caller is already holding the token), or ERRORREQUEST (the caller has already 
requested the token). 

A complete formal specification is given in Appendix A. Here we will only list a 
few histories that illustrate this example. 

We assume that initially the token is held by client 1. Consider the history H \ : 

H\ = QyF, Ryok , Pi(3):ok, Qy.T 

Q,R,P stand for QUERY, REQUEST, and PASS operations; T and F stand for the 
return values TRUE and FALSE. Client 3 invokes a QUERY and finds out that it is 

! The request operation is non- blocking. A client that needs the token would 
invoke a request operation and then repeatedly issue a QUERY operation until it 
returns TRUE. 
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not bolding the token. It then decides to request the token. Client 1 (the initial 
token holder) passes the token to client 3, and consequently a QUERY by client 3 
returns TRUE. We would consider this a legal history, i.e., H\ € 5. 

Hi = Qz'.F, Rz’.ok, Pi(3):ofc, Qz:F 

Hi is an example of an illegal history: although the token has been passed to 
client 3, the last QUERY returns FALSE. In this example the token passing service 
would have returned the wrong value for the QUERY; therefore Hi £ S. 

Hz * Qz'.F, Rz'.ok, Pz(2):ok, Qz'.F 

This history is also illegal: client 3 is passing the token although it is not holding it. 
The token passing service behaved incorrectly by returning OK for this operation. 
It should instead have returned the value ERRORHOLDER, indicating an error: 

Hz — Qz-F, Rz'.ok, Pz(2): Error Holder, Qz'.F 

In our formalism we make a number of implicit and explicit assumptions about 
the distributed service. 

1. A client invokes only one operation at a time and waits for the operation to 
complete before invoking the next one. 

2. We assume every operation can be executed as soon as it is invoked. In 
particular there are no operations that explicitly wait until another client has 
taken a certain action. In our formalism, operations with wait semantics must 
be modeled by a “busy wait” . The token passing service for example does not 
have an operation WaitForToken. Instead we provide the REQUEST and 
QUERY operation. A client waiting for a token would periodically invoke a 
QUERY until it returns TRUE. In Appendix B we show formally that any 
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operation with wait semantics can be modeled in this way. We chose this 
model because it is simpler and entails no loss of generality. 

3. We require specifications to be prefix-closed. This allows us to decide, at 
any time during the execution of a system, whether the service has behaved 
correctly so far. In other words, the correctness of an execution up to a given 
time does not depend on any future events. The prefix closure of 5 also 
makes it unnecessary to consider infinite histories. An infinite, legal history 
is represented in 5 by all its (finite) prefixes. However, because histories 
are finite and specifications are prefix-closed, our formalism can only express 
safety properties, not liveness properties [SA85]. 

4. We only consider deterministic specifications in which the value returned by 
an operation is determined completely by the parameters of the operations 
and by the previous history. Also, because specifications are complete, all 
operations are toiaL In other words, clients are not restricted to invoke only 
“legal” operations. Any specification can be made complete by specifying that 
an operation should return a distinguished value ERROR when performed in a 
state in which it would otherwise not be legal to execute the operation. 

Our specifications differ from other formal specification methods. In particular, 
we do not associate any state variables with a service. For example, consider a 
service that provides two operations READ and WRITE. Instead of saying that a 
write(x=5) changes the value of some internal variable x, we specify the effect of 
this operation by saying that the next operation READ(x) should return the value 

5. Rather than specifying how an operation changes the internal state of a service, 
we specify how the operation affects the result of future operations. 
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In some sense these two approaches for formal specifications are equivalent. In 
our formalism the current state of a service is represented by the history of all 
operations executed so far. A new operation changes this state by appending an 
event to the history. We chose the history-based approach because it does not 
assume any specific internal representation for the state of the service. 

3.2 System Execution Model 

The main goal of this dissertation is to find out how different forms of broadcasts 
can be used to construct a solution to a problem that is specified in the formalism we 
introduced in the previous section. In this section we present a model for studying 
broadcast-based implementations of a service. 

In the most general terms, a distributed implementation of a service runs like 
this: 

• A client at processor i invokes an operation a. 

• Processor t starts an agreement protocol among all processors to decide on the 
effect of the operation and its return value. 

• When the protocol terminates, the result is returned to the client. 

We will show in Section 3.5 that in order to obtain an implementation of any 
formally specified problem, it is sufficient to have agreement protocol establish a 
global order on all the operations invoked by different clients in the system. An 
atomic broadcast (ABCAST) does just that. An implementation based on abcast 
would run like this: 


• A client at processor t invokes an operation a. 
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• Processor i puts the operation (including its parameters) into a message and 
broadcasts the message to all sites in the system (including itself). 

• Other processors that receive this message update their local state. 

• When site i receives its own message, it also updates its state and at that 
time computes the result to be returned to the client. 

In Section 3.5 we will make this more precise and prove that such an implementation 
indeed gives a correct solution for any specification. In Chapter 4 we will then 
explore conditions under which it is possible to replace the ABCAST by a more 
efficient broadcast protocol that does not require the client to wait for a multi- 
phase agreement protocol to finish before it gets back its return value. In the rest 
of this section we define our model for broadcast-based implementations and give a 
criterion for their correctness with respect to a formal specification. 

3.2.1 Execution Histories 

An execution of a broadcast-based implementation outlined above can be described 
by a picture like Figure 3.3. The horizontal lines show events happening at different 
processors. To simplify the model we asstime that there is only one client per 
processor. There are two different types of execution events 2 . 

1. Invocation events , which denote the invocation of an operation by the local 
client. An invocation causes a message to be broadcast to all sites. These 
messages are represented by the arrows in the figure. 

2 Note that execution events are different from formal events as in Definition 3.1. 
Definition 3.11 in Section 3.2.2 relates these two types of events. 
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El = (2,1) c (3,1) (1,1) (3,2) 

Ei = a (2,1) (3,1) (1,1) (3,2) 

E, = (2,1) b (1,1) (3,1) d (3,2) 

E t = (2,1) (3,1) (1,1) (3,2) 


Figure 3.3: An execution history 
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2. Receive events, which denote the receipt of a message that was broadcast from 
some other site. The tip of each arrow represents a receive event in the figure. 

Consider, for example, the events at processor 2 in Figure 3.3. The first event is an 
operation “a” invoked by the client at that site. The invocation causes the processor 
to send out a broadcast, which is represented by the four arrows originating from 
the circle at “a”. The end of an arrow represents the receive event that arises 
when the broadcast message is delivered at another (or the same) site. A receive 
event is labeled by a pair of integers; the first one designates the processor that 
sent the broadcast, and the second one counts broadcasts sent from that processor. 
The second event at pi is the receive event (2,1), denoting the delivery of the 
first broadcast from itself. There are three more receive events at pi: (3,1), (1,1) 
and (3,2). They denote the delivery of the first broadcast from p%, the first one 
from pi, and the second broadcast from pz. Below we describe such a graphical 
representation of an execution in formal terms: 

Definition 3.4 

An execution sequence E = (E \, . . . , E n ) is a collection of totally ordered sets 
of invocation and receive events, 

E € [(/UiV 2 )*]", 

satisfying the condition: 

V iav£(i,j) : V k: 3 unique receive event (i,j) 6 
where invE(i,j) denotes the j ’th invocation event in E%. 


We now introduce some terminology and notation: 
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Definition 3.5 

F 

<i 

inv E (i,j) 

rcv E ({i,j),k) 
inum E {i,j) 

mu m E ((i,j),k) 

E -a 

Definition 3.6 

Given an execution sequence E we define the relation on the events in E: 
E[i,j\ % £(*',; +1] for all i,j. 

iuv E (i,j) rcv E ((i,j),k) forallt,;,*. 

If a b we say that a directly precedes 6. 

An execution like the one in Figure 3.3 can be viewed as a directed graph that has 
invocation and receive events as nodes and two types of edges: the horizontal lines 
that connect events happening at the same processor and the arrows that represent 
broadcast messages. The relation defines the edges in the execution graph. 

For an execution sequence to make sense we need to add a few more restrictions 
to Definition 3.4. For example, we need a condition that prevents messages from 
flowing backwards in time. 


the ;’th event in F,\ 

the order of events in Ei, i.e., E[i,j] <; E[i,j'] iff j < j’. 

the j’th invocation event in Ei. 

the receive event (i, j) in F*. 

the sequence number of inv E (i,j) in Ei, 
i.e., if E[i,l] = inv E (i,j) then inum E (i,j) = l. 

the sequence number of rcv E ((i,j),k) in F*. 

i.e., if F[fc, /] = rcv E ((i,j),k) then mum E ((i,j),k) s= l. 

(where a = inv E (i,j) is an invocation event) the execution 
history that is identical to F except that a and all its corre- 
sponding receive events (rcv E ((i,j), k), for all k) are deleted. 
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Definition 3.7 

An execu t ion history E = (E \, . . . , E n ) is an execution sequence satisfying the 
following additional conditions: 

• Sequential invocation: Clients invoke operations sequentially, i.e., a 
client waits for the present invocation to complete before invoking a new 
one: 

Vi,j: rcv E ((i,j),i) <i inv E (i,j + 1) 

• Monotonicity of time: The u -^” relation is acyclic (messages do not 
flow backwards in time): 

n — . D D D D 

-> d ei, . . . , e m 6 E: e\ e\. 

In addition we may specify the ordering properties of the broadcast protocol used 
by giving a message ordering axiom. For example, if we are interested in systems in 
which an atomic broadcast is used, we would specify an ABCAST-axiom that ensures 
that all messages are received in the same order everywhere: 

Definition 3.8 

ABCAST ordering axiom: 

Vi,j, Vfc,/: 

rcv E ({iJ), k) < k rcv E ((i\j'), k) o rcv E ((i,j), l) </ rcv£((i',j'), l). 

An execution history that satisfies this ABCAST axiom in addition to the require- 
ments of Definition 3.7 would be called an “ABCAST execution history”. 
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3.2.2 Implementations 

The previous section described a system execution only in terms of what operations 
clients invoke and when messages are sent and received. It does not specify what 
the contents of these messages are, how the recipient processes such a message, or 
what values are returned to the client as the result of an invocation. In other words, 
we need to specify what the program running at each site does. 

We do this by modeling each processor as a state machine that reacts to input 
events (invocation events or receive events) by changing its state and generating an 
output event (message to be broadcast or value returned to the client). This state 
machine has two types of transition functions (<f> and 0), corresponding to the two 
types of input events. 

Definition 3.9 

An implementation is a 8-tuple (n, 7 , V, M , Q, qo, ¥), where 

n 

I 
V 
M 

Q 

* = (^l»--->^n) 


the number of processors in the system 

the set of operations that can be invoked 

the set of return values 

the set of message values 

the set of states in which a processor can be 

the initial state of all processors 

invocation transition functions 
<f>i : Q x I -> Q x M 

message receive transition functions 
xbi : Q x M — ► Q xV 


<6 = (0!,...,lM 
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The meaning of the transition functions <j>i and t />,• is as follows: When an operation 
a € I is invoked by client i, processor i changes its state from q to q' and broadcasts 
the message m, where (q 1 , m) = <pi(q, a). When such a message is received at site 
j, processor j changes its state from q to q' , where (q',v) = rpj{q,m). The return 
value for this operation is v; at the site where the operation was invoked this value 
is passed to the client; at the other sites it is ignored. We will use superscripts 
s, m, v to refer to the state, message, and return value of <p and ip, respectively, as 
defined below: 

if <pi(q,a) = (q',m) then <Pi(q,a) = q', <p?(q,a) = m; 
if ipi(q, m) = ( q ', v) then ipf(q, m) = q \ rpf(q, m) = v. 

Given such a formal implementation, we can take an execution history and deter- 
mine what messages are sent and what values are returned to the client. We start 
by giving a definition for computing the state of a processor after a particular event 
in an execution history: 

Definition 3.10 

state [i,j] 

f 

go if j = 0 

<p*(st&tE[i,j-l],a) if = a is an invocation event 

Vtf(state[*> j-l ]> m ) if £[*,,/] = (k, /) is a receive event, where 

m = $J*(state[fc, inumsik, /)— 1], invjj(k, /)) 

Then state[t, j] defines the state of processor t after E[i,j], the j’th event at that 
site. Note that the monotonicity requirement for execution histories (Definition 3.7) 
prevents this definition from being circular. It is now straightforward to give def- 
initions that compute the messages being sent, the values returned to clients, the 
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formal events (invocation plus return value as in Definition 3.1) observed by clients, 
as well as the sequence of formal events that any particular client observes: 

Definition 3.11 

msg E (iJ) = <t>?(stat E [i,imun E (i,j) - l],inv £ (i, j)) 

v *l E (i,j) = ri>i(stat E [i,mum E ((i,j),i) - l],msg£<t,;)) 

event E {i,j) = a:v, where a = inv E (i,j), and v = val E {i,j), 

H[E , t] = (eventf^i, 1), event E {i, 2), . . . , event E {i, m)), 
where m is the number of invocation events in E%. 


3.3 Implementation Correctness 

In Section 3.1 we defined formal problem specifications in terms of totally ordered 
histories which record a sequence of events executed by a centralized service. In 
our model of distributed implementations, however, there is no centralized service. 
Instead of one global history, we have a set of histories H [E, i] containing the subset 
of events observed by individual clients. We consider such a distributed implemen- 
tation correct if, to the clients, its behavior is indistinguishable from the behavior 
of a centralized service which performed the same set of operations. In particular, 
the implementation must satisfy the following condition: 

For every execution history E, it must be possible to merge all H[E, *] 
into one legal, global history H € 5. 

This ensures that clients cannot distinguish an execution of the distributed im- 
plementation from a centralized one, because they all see part of a history that 
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would have been generated by a centralized server. This correctness condition is 
vej , similar to the notion of serializability familiar from database theory [BG81, 
Pap79|. 

This condition alone is not enough to ensure that the distributed implementation 
behaves as one would expect. We need to add a condition that says something 
about the relative order in which events invoked at different processors appear in 
the global history H. Consider the token passing example from Section 3.1. One 
could implement the token service in the following trivial way (recall that client 1 
is the initial token holder): 

• QUERY always returns FALSE if it is invoked by any client other than client 1. 

• If invoked by client 1, QUERY returns TRUE until client 1 passes the token. 

After the event P\ (j ) : ok a QUERY by client 1 always returns FALSE. 

This implementation effectively “loses” the token after the first pass operation, be- 
cause subsequent queries by any client return FALSE. However, notice that the im- 
plementation satisfies the correctness condition stated above. An execution history 
for this service might generate the following collection of formal histories observed 
by clients: 

H{E, 1)= Q,:T, Pi(3)-.ok, Qy.F, Qy.F 
H\E,2] = Qy.F, Qy.F, Qy.F, ..., Qy.F 
H\E, 3] = Qy.F, Qr.F, Qr.F, ..., Qy.F 
These three histories can easily be merged in to a legal history: 

H= QyT, Qr.F, Qr.F, ..., QyT, Qr-F, Qy.F, Pi(3):l, Qy.F, Qy.F 

In other words, by putting the PASS event (and everything following it in H[E, 1]) 
at the very end of H , we always get a legal merged history. 



39 


To solve this problem we need to add a condition that prevents events from 
being indefinitely deferred in the merj^J ustory H. An event observed by one 
client should eventually get a “stable” place in H. We have to define what we mean 
by “eventually”, since our execution model does not contain real time. Recall that 
we are assuming an asynchronous distributed system in which messages may be 
delayed arbitrarily. It could be that the broadcast protocol initiated for the PASS 
operation terminates quickly at site 1, whereas due to message delays it finishes 
much later at sites 2 and 3. In this case we would consider the execution outlined 
above an acceptable behavior of a distributed implementation. Therefore we add 
the following condition: 

Once a broadcast message about an event a has been received at site i, 
the event becomes “stable” with respect to other events at site i. That 
is, when we construct a legal, global history H by merging individual 
processor histories, any event b that was invoked at site i after the 
message about a was received at i, must be ordered after a in H. 

In other words, we allow an event invoked at another site to be ignored only as long 
as the message about it is still in transit. In the token passing example above, this 
condition says that as soon as the message about the operation PASSi(3) is received 
at site 3, the next QUERY operation should return TRUE. 

The next definition summerizes our two correctness conditions for distributed 


implementations. 



40 


Definition 3.12 

Y is « ect XBCAST-implementation of specification S — (n, /, V, S) iff: 

V XBCAST execution history E: 3 H € S : 

Correctness : Vi: H\i = ff [£, i] 

Liveness : Vi,j,k: 

rcv E ((i,j) y k) <k inv E (k, l) =» event E {i,j) <h event E (k, l) 

Here “XBCAST” stands for the type of broadcast used in the implementation. As 
discussed above, the second condition (liveness) makes sure that as lon g as the 
broadcast protocol guarantees that every message will eventually be be delivered 
everywhere, every operation invoked by a client will eventually be reflected in op- 
erations at other sites. In other words, liveness of the broadcast protocol implies 
liveness of the implementation. 

3.4 Externally Observed Histories 

Our model of execution histories does not contain any notion of real time. This raises 
the question: How does an execution history relate to what an external observer 
sees during the execution of an implementation? In an asynchronous system it 
does not make much sense to talk about time in absolute terms (e.g., milliseconds). 
However, we can consider the relative order — in real time — of events occurring 
during the execution of an implementation. Imagine an external observer who is 
able to monitor all nodes in a distributed system simultaneously. Such an observer 
would be able to determine a total order on all invocation and receive events at all 
sites. We call this sequence of execution events an external history , E tx i. We can 
make the following statement about the relationship between the formal execution 
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history E and the corresponding external history E ex t : 

1. The formal execution history E already determines a total order on all events 
that happen at the same processor. Therefore, for all i, the events in £,• 
appear in exactly the same order in £ cx $. 

2. The relative order of events at different processors is not determined by E , 
except that a receive event can never precede its corresponding invocation 
event, because messages do not flow backwards in time. 

We can summarize this in the following statement: 

The external history £ ex $ can be any total order on the events in E 
that is consistent with 

Given E ex t we can extract an external formal history H ex t recording all the formal 
events during the execution of an implementation in the order they are seen by the 
external observer. Notice that our correctness definition does not imply that H ex t 
is always legal. However, it ensures that there exists a legal history that is similar 
to H ex t, as defined below. 

Definition 3.13 

H is similar to H' (H w H’) iff Vi: H\i = H'\i 

If clients communicate only through requests to the distributed service, then similar 
histories are indistinguishable to all clients. For a correct implementation it will 
always be the case that 

V execution history E : 3 H 6 S : H ex t « H 

For example, an event a may logically be ordered before b in H (a <h b), but 
physically a could be observed after 6, if a and b are events at different processors. 
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Hence the above statement just rephrases the requirement that a correct, distributed 
implementation be indistinguishable from a centralized implementation in which the 
externally observed history H tx t is always legal. 

3.5 Abcast Implementation 

In Section 3.2 we claimed that the strong ordering properties of the atomic broad- 
cast provide enough synchronization between processors to solve any problem that 
can be specified in our formalism. In this section we will prove this claim. The 
purpose of this exercise is twofold. By showing that every problem has an ABCAST 
implementation, we demonstrate that our model of broadcast-based implementation 
is not too restrictive. Furthermore, in the next chapter we will use methods similar 
to the ones in this section to construct implementations based on more efficient 
protocols. 

Given a formal specification S — (n, /, V, S) we will construct an implementation 
that satisfies our Definition 3.12 of correctness for all ABCAST execution histories. 
Figure 3.4 describes this implementation informally in Pascal-like pseudocode. The 
implementation is essentially a variation of the state machine approach to replication 
as described in [Sch86]. The current system state is represented by the sequence of 
all operations executed so far (variable l H' in Figure 3.4). This state, as well as the 
execution of client requests, is fully replicated. An operation invoked by a client is 
broadcast to every site (including the one at which it was invoked) and is executed 
everywhere when it is received. Executing an operation in a state H simply means 
adding a new event to H after choosing a suitable return value v, such that the 
new history is still legal. The requirement that specifications be deterministic and 
complete ensures that there is always exactly one choice for such a value. This, 
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Processor i runs the following program: 

H empty; 

loop 

wait for an invocation by the local client or the receipt of a broadcast; 
if client invoked operation a then 
ABCAST “a” to all processors; 
else if broadcast “a” was received from j then 
pick a value v, such that H + a:v 6 S; 

H := H + a : v; 

if j = t then return value v to the client end if 
end if 
end loop 

Figure 3.4: ABCAST implementation 
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and the fact that ABCAST delivers all broadcasts in the same order at every site, 
implies that all processors will - on the same legal history H of events that 
have occurred in the system. This history will satisfy the correctness condition in 
Definition 3.12. In order to prove this, we translate this implementation into our 
formal execution model. 

Because specifications are deterministic and complete, we earn define an execution 
function Xs : 5 x I —* V such that 

Vff6S,a€/s: v — Xs(H , a) => H + a:v € 5, 

or in words: Xs computes the correct return value of operation a invoked in state 
H. Given a specification S = (n, /, V, 5) we define the implementation Yj: 

Y S = (», I, V, I, (l x Vy,i, *, *), i.e., M = /, Q = (/ x V)\ w = », 

where the transition functions $, 'f are defined as follows: When operation a is 
invoked at processor t in state H, it does not change its state but broadcasts “a”: 

When processor i receives a message containing operation a it executes the operation 
by adding the event a:v to its history H\ the value v is returned to the client: 

= (H + a:v,u), where v = Xs(H,a). 

Lemma 3.1 

For every ABCAST execution history E of Y$: the final state of all processors is 
identical. 

Proof: A processor state only changes when a message is received {<j>\ is the 

identity function). Because of the ABCAST ordering axiom (Definition 3.8), all E{ 
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contain the same sequence of receive events. Since all processors start in the same 
state qo = 0 and the transition functions axe identical, all processors will end up 
in the same final state. □ 

Theorem 3.1 

Ys is a correct ABC AST implementation of specification 5. 

Proof: We show that for every execution E, the history Hf given by the final 

state of processors in E is legal (Hf € S) and satisfies the correctness and liveness 
conditions of Definition 3.12. We do this by induction on the number of events in 

H f . 

The base case, Hf = 0, is trivially satisfied because an empty history is always 
legal. This follows from the fact that specifications are prefix-closed. 

For the induction step, consider an execution history E such that Hf is non-empty. 
Let rcv£'({i, j), 1) be the last receive event in E\. Because of the ABCAST ordering, 
rcv E ((i,j), k ) is the last event in £?* for all k. Let E' — E — inv£(i,j) (i.e., E with 
inv E (i,j) and rcv£((i,j), k), for all fc, deleted). Let Hf be the history given by the 
final state of processors in E' . 

We first show that Hf is legal. By induction hypothesis H'f € S. Furthermore, 
Hf = Hf + a:v, where 

v = vai . «?(«/.<■) = Xs(H',,a). 

Therefore Hf = H'f + a:v 6 5 follows from the definition of X 5 . We complete 
the proof by showing that Hf satisfies the correctness and liveness conditions of 


Definition 3.12. 
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Correctness: We have to show that Hf |; = H [E, *] for all i. By induction hypoth- 
esis Hj r |t = H[E' , *]. Therefore 

H f \i = (Hf + a:v) |,- = H' f \i +a:v = H[E',i)+a:v = H[E,i]. 

Liveness: Let rcvE((l,m),k) <* mv£(/ , ,m , ). We have to show that eventful, m) < 
eventE(l',m') in Hf. Case 1 ( l',m ') = (i,j): In this case even = 
event E(i,j) is the last event in Hf, and therefore event e(1, m) < event e(1' ,rn') in Hf. 
Case 2 (V ,m') ^ In this case the two events event£(/, m) and events^' ,m') 

axe both in H'f, and by induction hypothesis event E{h m ) < event e{ 1' ,rn') in H'f, 
hence also in Hf. □ 

3.6 Summary 

We presented two different models for a distributed program: one for the formed 
specification of the program and one for its implementation. 

1. We modeled the behavior of a distributed program as a service that executes 
requests on behalf of clients. An execution of such a service is described 
as a sequence of events, in which each event denotes the execution of one 
client request. We call such an event sequence a formal history. A formal 
specification for such a service is a set 5 that lists all possible legal histories. 

2. Our implementation model describes a system as a collection of state ma- 
chines. Each processor reacts to input events (invocation events or receive 
events) by changing its state and generating an output event (message to be 
broadcast or value returned to the client). 
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We then defined the correctness of an implementation with respect to a formal spec- 
ification in such a way that clients cannot distinguish the behavior of the distributed 
implementation from that of a central server. 

To illustrate our formalism and to show that our implementation model is not 
too restrictive, we demonstrated that any formal specification has an abcast im- 
plementation. 


Chapter 4 

Asynchronous Implementations 


In the previous chapter we saw that every formal specification has an ABC AST im- 
plementation. In this chapter we address the main questions of this dissertation: 
Can we construct more efficient implementations by using broadcast protocols that 
provide a weaker form of ordering? For which kinds of problems will this be suc- 
cessful? 

We start by considering implementations based on a causal broadcast (cbcast). 
We give a necessary and sufficient condition for a specification to be implementable 
with this type of broadcast. If such an implementation exists it can be expressed in 
a standard form. Finally, we show that a CBCAST implementation can be translated 
into an implementation based on FBCAST or even unordered broadcasts. 

The implementations we construct this way can be characterized as follows: 
When a client invokes an operation, the return value can always be computed im- 
mediately from local information. This way the client need not wait for messages 
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to arrive at other sites or for replies to make it back; information is propagated 
asynchronously to other sites. Therefore we call this type of implementation asyn- 
chronous. 


4.1 Causality and Timestamps 

In Chapter 2 we introduced the idea of Potential Causality [Lam78|. In our execution 
model we can define this relation as follows. 

Definition 4.1 

The information flow relation ►” on the events in an execution history E is 
the transitive, reflexive closure of 

Two events o, b that are not related under are called concurrent ( a//b ). 

If we interpret an execution history E as a directed graph (the nodes are the invoca- 
tion and receive events in E; the edges are given by then a —*■ b if and only if 

there is a path from a to b in this graph. Because u -^” is acyclic (Definition 3.7) the 
information flow relation ►” defines a partial order on the events in an execution 
history. 1 The intuitive meaning of this relation is the following. An event a can 
affect some other event b only if it precedes b in this partial order. In particular, the 
state of a processor after an event b depends only on events that precede b under 
This fact is expressed in the next lemma. 

however, note that we define to be reflexive (V e € E: e — » e), contrary 
to the usual definition of a partial order. This notational convenience makes the 
later definitions simpler. 
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Definition 4.2 

E' is a prefix of E (E' < E) iff 

(i) E' C E 

(ii) Vi: V a, b € E \ : a <,• b in E' a << 6 in E 

(iii) V a, b£ E: b e E' A a -* b ^ a & E' 

For an event a € E, we define F[a] (the prefix at a) as follows: 

(i) £[a] < E 

(ii) V invocation event a' € E: a' 6 i?[a] a' — ► a 

Lemma 4.1 

Let E and £' be two execution histories and a = F[i,j] = F'[i,j] be an event 
occurring in both histories (a € E n £"). 

If £'[a] = £[a] then 

(i) stat£/(i, j] = stat£[i ,j] 

(ii) msg£>{i , /) = msg£{i , /) if a = inv£{i , /) is an invocation event 
val£i(i, l) as vad£(i, l) if a = rcv£((i, /), i) is a local receive event 

In other words, if we take an execution history E and modify it into a history E 1 in 
such a way that events preceding a under are unchanged (i.e., £[a] = F'[a]) 
then the state of processor t after event a as well as the message sent or the value 
returned to the client will not be affected by these modifications. Hence the lemma 
tells us that E[a\ contains exactly those events in E that have an effect on the 
outcome of a. 

Proof: By induction on the number of events in 2?(a]. In the base case E[a] 

contains only a single event, namely a. Then a must be the first event in Ei , i.e., 
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j = 1; otherwise the event preceding a at i would also be in E[a\. Furthermore, 
a cannot be a receive event; otherwise the corresponding invocation event would 
precede a under “—►’’and would be in £[a]. Therefore by Definition 3.10 

sUtE[i,j) * stat£[i, 1] = #(stat£[t, 0], a) * #(g 0 , a). 

Because stat£[i,0] = stat£/[t,0] = qo we have stat£/[i, 0] s stat£[:,0]. Furthermore 
by Definition 3.11 

msg£(t,l) = #"(stat£[*,0],a) = (f>T(qo,a) - msg P (i, 1). 

For the induction step consider 2?(a] with more than one event. Let s = stat£[i,j — 1] 
and s' — st&t£i[i,j — 1]. If j = 1 ( a is the first event at p,) then s — s' — qo. Other- 
wise let b — E[i,j — 1] and V = E’[i,j — 1] be the events preceding a at Ei and E\. 
Then b -* a and V — ► o; hence b 6 E[a],b/ € E'[a\. Because E'[a] = E{a\ we have 
1/ — b and £'[6] = £[6] < F[a]. By induction hypothesis the state of p< after b is 
the same in E and E'; hence again 3 = 3 '. 

If a is an invocation event then 

stat£[i,i] = $(s,a) = </>f(s',a) = statp 
msgE(i,j) = <t>r(s,a) = #"(*',<») = msg E >(i,j). 

Otherwise a = rcv£({j, /), ») is a receive event. Let c = invE{j, l ) and d = inv £•{]•> 0 
be the corresponding invocation events in E and E’ . Then c — ► a and d —►a'. It 
follows that c, d G F[a] = £*[a], and therefore c = d and E'[c\ = E[c\. By induction 
hypothesis msgE(j, l) — msgE>(j, /). Therefore 


sUt E [i,j] = xl>i{s,msgE{i,l )) = 0f(s',msg£/(t, /)) = sUt E >[i,j], 
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and if j = i (a local message was received) then 

val E (i,l) = ipf(s,msg E (i,l)) = ipi(s' ,msg E ,(i,l)) = val E ,(i,l). 

□ 

Corollary 4.1 

Let E' < E be a prefix of the execution history E. Then 
Va = E'{i,j] € E'\ 

(i) sUt E ,[i,j] = stat E [i,j] 

(ii) msg E t(i, l) = msg E {i, l) if a = inv E (i, l) is an invocation event 
vai£/(», /) = val E (i, l) if a = rcv E ((i , /), *) is a local receive event 

Proof: E' < E implies E'[a\ = E[a] for all a 6 E'. □ 

In Section 3.2.2 we defined H[E, i] to be the sequence of formal events observed 
by a particular client in an execution E of an implementation Y. The relation 
induces a partial order on on the formal events in the union of all H[E, *']. We call 
this partially ordered set of formal events derived from E and Y a run. It is defined 
formally below: 

Definition 4.3 

Given an execution history E and an implementation Y we define the run 
Ry{E) to be the set of formal events given by E, partially ordered by 

Ry(E) = {event E {i,j) | for all t,j] 

with the partial order on Ry(E) defined as 

event E (i,j) -+ event E (l,m) inv E (i,j) —► inv E (l,m) 
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As in the case of an execution history we use the notation event R (i,j) to denote the 
j’th event at processor t in a run R. 

In [Lam78] Lamport introduced logical timestamps, integers assigned to each 
event in such a way that if all events are ordered by their timestamp this order is 
consistent with . We can generalize this idea to timestamps which are vectors 
of integers [Sch85]. 2 Such timestamps are useful for keeping track of the partial 
order of events as the system executes. 

Definition 4.4 

A timestamp t for an event e = event R (i,j) € R is a vector of n integers with 
the following meaning: 

*e[fc] = || {eventp(k,l) 6 E k \ event R {k,l) -* e} || 
i.e., t e [k] is the number of events at k that precede e in the partial order. 

The following lemma states that given only the timestamp of two events in a run 
one can deduce their order under 

Lemma 4.2 

Let a ss event R (i,j) and b — event R (k, l) be two events in a run R, and let t a 
and ft be their timestamps. Then 

a->b & <«[*']< <4(*] 

Proof: a = event R (i,j) is the ;’th event invoked at site *. Therefore event R (i,j') — * 
a iff j' < j, and hence <„[»'] ss j. 


2 The idea of vector timestamps was developed independently by Ladin and 
Liskov [LL86]. 
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If a — ► b then by transitivity of *” event n(i,j') — * a for all j' < j. Hence tjfi] > j. 

Conversely, let / = fj[i] > ;. Then there axe at least j events at processor i that 
precede b under In particular, there must be an event event — * b for 

some j' > j. which implies that a — + event n(i, j ') , and by transitivity a —* b. □ 

We will use these timestamps in the implementations we construct in the next 
section. 


4.2 Cbcast Implementation 

The causal broadcast protocol described by Birman and Joseph in [BJ87b] is a proto- 
col that preserves the information flow relation between events, i.e., whenever two 
broadcasts b\, are related under (&i — ► 6j), the protocol guarantees that &i 
will be received before b% everywhere. Concurrent broadcasts may be received in 
different orders at different sites. In our formalism we define the ordering properties 
of CBCAST by the following sodom: 

Definition 4.5 

CBCAST ordering axiom: 

(i) Causal ordering: 

inv£(i,j) -+ mv£(l,m) => V k: rcv£((i,j),k) < k rcv E ((l,m),k) 

(ii) Immediate local delivery: 

Vi,;': -i 3 a : inv£(i,j) <* a <* rcv E ((i,j),i) 

How can we use such a broadcast to construct an implementation for a given spec- 
ification 5? Our plan is to take the ABCAST implementation from the previous 
chapter, replace the ABCAST by a CBCAST, and determine under which condition 
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the implementation will still be correct. In order for this to work it is necessary to 
make two more modifications to the ABCAST implementation: 

• Recall that the correctness of the ABCAST implementation depended on the 
fact that all processors agreed on the order of events and therefore construct 
the same legal history H. If we use a CBCAST instead of ABCAST this will no 
longer be the case. However, using the timestamps introduced in the previous 
section it is possible for all processors to keep track of and agree upon the 
partial order of events during the execution of the system. In other words, we 
replace the the variable H in Figure 3.4 by a variable A, containing a partially 
ordered set of events (a run). 

• Now, in order to execute an operation correctly it is necessary to relate these 
runs to globally ordered histories as they appear in a formal specification. 
For this purpose we introduce a function that maps partially ordered sets of 
events to totally ordered histories. We call this a linearization operator, it is 
defined formally below. 

Definition 4.6 

A linearization operator, LIN U { _L}, is a partial function 4 from runs 

to histories, such that: 

(i) LINQ) = 0 

(ii) V R: UH = LIN(R) £ 1 then 

Va: a € H a € R 
Va,b&R: a — ► b => a<gb 

“The symbol X denotes an undefined return value, i.e., LIN(R) = X means LIN 
is undefined on R. 
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Processor i runs the following program: 

R empty; 

t := [ 0 , 0 ,. .., 0 ]; 

: = i; 

loop 

wait for an invocation by the local client or the receipt of a broadcast; 

if client invoked operation a then 

pick a value v, such that LIN{R + a:v ) € S 
CBCAST ( a:v , f ) to all processors; 
return v to the client; 

else if broadcast ( a:v,t ) was received from j then 
R := R ©< a: v; 
t[j] := t[j] + 1; 

end if 
end loop 

Figure 4.1: CBCAST implementation 

For every run R on which LIN(R) is defined, LIN linearizes the events in R in a 
way that preserves the partial order w — ►” in R. With these changes, our CBCAST 
implementation, in informal pseudocode, looks like Figure 4.1. 

We need to answer two questions. 

1. For which kind of specifications will this CBCAST implementation be correct? 

We answer this question in Section 4.2.1. 
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2. How general is this implementation? Perhaps there are other methods of con- 
s ug a CBCAST implementation that cover a larger class of specifications. 
We address this question in Section 4.2.2. 

In order to answer these questions we need to translate the CBCAST implementation 
from Figure 4.1 into our execution model. 

Definition 4.7 

We use the following notation: 

R + e the run R with the event e added at the end, i.e., ordered after all 

other events in R (Ve' € R + e: e' — ► e). 

R! < R R! is a prefix of R 

Rf C R A Ve,e' € R: (e -♦ e' A e' 6 R!) eg #. 

f?[e] The run consisting of all events preceding e in R, i.e., 

R[e] = {e'eR | e' - e} 3 

R! ©t e the run R with the event e added and ordered according to its 
timestamp f e > i.e., e — ► even tji{i,k) & t t [i] > k. 

The implementation outlined in Figure 4.1 contains a construct 

u pick a value v, such that LIN(R + A*:v) 6 S” 

similar to the one use in Figure 3.4 (abcast implementation). In the abcast 
case we used the fact that specifications are complete to argue that such a return 
value will always exist. In the CBCAST implementation the argument no longer 


3 Recall that we defined ►” to be reflexive. Therefore the event e is always 
contained in R[t] 
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holds, because, as defined in Definition 4.6, LIN may reorder events in R ® A{ : v 
differently for different values of v. Therefore we must place the following restriction 
on LIN: 

Definition 4.8 

A linearization operator LIN is constructive under 5 = (n, I, V, 5) iff 
V runs R such that LIN(R) € 5: V invocations a € I: 

3 a return value v € V: LIN(R + a:v) € S. 

If LIN is constructive under S we can define an execution function Xs,lin ’■ S x I 
V (similar to the one used in Section 3.5) with following property: 

V R such that LIN(R) € 5 : V a € I: 
v = Xs,lin(R y<*) ^ LIN(R + a:v) € 5 

Given a specification S = (n, /, V, S) and a constructive linearization operator LIN, 
we define the implementation Ys,LIN as follows: 

Y S ,LlN = (n , /, V, /xT, (/x V)#xT, (0,[O,...,1,...,O]), $, ¥), 
i.e.,M = /xT, Q = (IxV)#xT, qo - (0,[O,...,1,...,O]), 

where A# denotes the set of all partially ordered subsets of A , i.e., (I x V)# is the 
set of all runs than can be constructed from events in / x V. T = N n is the set 
of all timestamps (integer valued vectors of length n). The transition functions fa 
and fa are defined as follows. When operation a is invoked at processor i in state 
[R, t], it does not change its state but broadcasts [a:u, t, *]: 

fa([R,t],a) = ([£,*], [<z:v,M]), where v = X s ,lin(R ,<*)• 
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When processor i receives a message [a:v,s,;j, it adds the event aw to its run R, 
and updates its times «,<unp vector by incrementing the j’th component; the value v 
is returned to the client: 

V>i([iZ,<],[a:v,5,;]) * ([H ©, a:v,t'],v), 
where t'\j\ = t\j\ + 1, = t[k] for k ^ j. 

4.2.1 Correctness of Cbcast Implementation 

The implementation we defined above has the property that every run Ry(E ) gen- 
erated by one of its executions, satisfies a property that we call local correctness : 

Definition 4.9 

A run R is locally correct under LIN and 5, iff 
V events eg R: LIN(R[e]) g 5 

This property can be interpreted as follows. Say e = a:v is an event invoked at 
processor i. In a run R = Ry s<LIN (^) °* a CBCAST execution, the subrun f2[e] 
contains exactly those events that processor i knows about at the time a is invoked. 
This is because e' g f2[e] implies e! — ♦ e; hence the CBCAST ordering guarantees that 
the message about e' is received before a is invoked. £7iV(f?[e]) g 5 then means 
that processor i executes the operation a in a way that is correct with respect to 
its local knowledge. 

As stated in the next theorem, our CBCAST implementation Ys,lin will be cor- 
rect if the specification 5 has the property that local correctness always implies 
global correctness (i.e., LIN(R) g 5). 
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Jnci(5) Readi:6 



Figure 4.2: A locally correct run 


Theorem 4.1 

If LIN is constructive and satisfies 

V runs R: R locally correct => LIN(R) € S. 
then Ys t LIN I s & correct CBCAST implementation of specification 5. 

Before we present the proof it is useful to give an example for a specification that 
does not satisfy this condition. Consider the problem of implementing a simple 
counter. Clients can increment the counter by a specified amount (inc(x)) and 
read the current value of the counter (read). It is straight forward to write down 
a specification for such a counter: in a legal history H every READ must return the 
sum of all increment values of INC operations preceding the READ in H. Figure 4.2 
gives an example of a run R containing INC and READ events (events are represented 
by circles, the partial order by the arrows in the figure). This run satisfies local 
correctness. Consider for example 


R[Read2:9] = (/nc( 1) — ► /nc(5)//7nc(3) — *■ Read: 9) 
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There are two ways of linearizing this subnin: 

LI N(R[Readi: 9]) = (Jnc^i), /nci(5), Inc^iZ), Rtadi.9), or 
LIN(R[Readi:9]) = (/nc 2 (l), Inci(Z), /nci(5), Read^.9), 

which are both legal histories. Similarly one can check that for every event e in 
this run, any linearization of i2[e] is legal. Therefore R is locally correct no matter 
how LIN is defined. But R itself can not be linearized into a legal history: If LIN 
orders /nci(5) before 7ncj(3) then the event ReadyA will be illegal; if Incz{Z) is 
ordered before Inc\(5) then Read\:6 is illegal. Hence although R is locally correct 
it does not satisfy LIN(R) € S (global correctness). 

We will give the proof for Theorem 4.1 on page 64 after the following three 
lemmas. 

Lemma 4.3 

(i) E‘<E*> R y (E')<Ry(E ) 

(ii) Let e = a:v € Hy(2?)[a]. Then 
Sy(£)[«l = 

Proof: Follows from Definition 4.2 and Lemma 4.1. □ 

The next lemma makes a statement about the state of a processor in Ys,lin at the 
time when a processor completes a client request and returns a value to the client. 

Lemma 4.4 

Let E be a CBCAST execution history and R = Ry s ,un^)‘ ^ e = a:v = 
event be an event in R and let E[i,l] = rcv^( (i,j),i) be the corresponding 
receive event in E. Then 

stat£[i,/] = [H(e],timestamp(e)]. 
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In other words, at the time a value is returned to client i the state of p,- correctly 
reu •' the timestamp of the event e as well as the ran i2[e] of all events preceding 
e under 

Proof: Let stat£[t,j] = [r,t|. 

(i) We will first show that r and i2[e] contain the same set of events. Let e 1 = 
event E {i' , j') € r. Then e 1 was added to r when p» received a message m = [e', t 1 , i'} 
from processor i'. Therefore 

rcv E ((i',j'),i) <i E[i, /] = rcv E ((i,j),i) 

Because of immediate local delivery (CBCAST axiom, Definition 4.5) this implies 
rcv£({t',/),i) <,• inv E (i,j) 

Therefore inv E (i’,j') — ► inv E (t,j). By Definition 4.3 e' — ► e in R; hence e' € i?[e]. 
This shows r C R[e\. 

Conversely consider t' = event E (i',j') 6 i2[e]; then e' — * e in R. By Definition 4.3 
inv E (i',j') —* inv E (i,j). Because of causal ordering under the CBCAST axiom 

rcv E ({i’,j'),i) <i rcv E ((i,j),i). 

In other words, jh receives the message m = [e', t 1 , »'] from processor i' before E[i, /] = 
rcv E ((i,j),i). Hence the event e 1 will have been added to r by that time, i.e., e' € r. 
This shows that f2[e] C r. We conclude that (as unordered sets of events) r = i?[e]. 

(ii) Next, we show t as timestamp(e). The vector component t\j] is incremented 
each time p,- receives a message from pj. Therefore 

t\j] = || {e ; € r | e' is an event invoked at pj} || 
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Part (i) of the proof implies 

{e' 6 r | e' is an event invoked at pj) 

= {e' 6 i2(e] | e' is an event invoked at pj} 

Therefore by Definition 4.4, t = timestamp(e) 

(iii) Finally, we show that the partial orders in r and R[e) are identical. This follows 
immediately from (i) and (ii), because in r events are ordered by timestamp. □ 

Lemma 4.5 

If LIN is constructive and satisfies 

V runs R: R locally correct =► LIN(R) € 5. 
then for every CBCAST execution history E : Ry s ,un(E) “ l° ca Uy correct. 

Proof: Let Y = Ys,lin • We proceed by induction on the number of events in E. 
The base case, E — 0, is trivially satisfied, because an empty run is always locally 
correct. 

Induction step: consider E 56 0. We have to show that LIN(RY(E)[e}) € 5 for 
every e in Ry(E). Let a = inv£(i,j) be an invocation event in E, and let e = a:v = 

event e(i, j)- 

Define E* — £[«], and E" = E' — a. By Lemma 4.3 J?[e] = Ry(E ') = Ry(E ") + a:v. 
By induction hypothesis Ry(E n ) is locally correct and therefore by assumption 
LIN(Ry(E n )) € 5. 

By Lemma 4.4 Ry(E M ) is equal to the “H-part” of the state stat£[i, j] of processor 
i at the invocation event a. The message it sends is determined by <f>: 


m = 4>Y(Ry{E"),a) = [a:v',f,i], where v' = Xs,i/jv(iM£"),a). 
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Therefore v = v‘ = Xs,£/Ar(f2[e], a). From the definition of Xs,LIN it follows that 
LIN(R{e})~ LI \ T (Ry(E") + a:v) G 5. □ 

We now have all the necessary tools to prove the theorem about the correctness of 
the CBCAST implementation. 

Proof of Theorem 4 . 1 : We show that under the assumption of Theorem 4.1 

(local correctness of R implies LIN(R) € 5), for every CBCAST execution history 
E , the history H = LIN(Ry(E)) € 5 and satisfies the correctness and liveness 
conditions of Definition 3.12. 

(i) By Lemma 4.5 Ry(E) is locally correct and therefore by assumption 

H = LIN(Ry(E)) € S. 

(ii) Correctness: We have to show that H |j = H[E , :] for all i. 

H |,' contains the same set of events as H[E,i\, because H = LIN(Ry(E)). Fur- 
thermore, the order <, on E{ is preserved in H, because e <, e 1 implies e — ► e', and 
LIN preserves 

(iii) Liveness: Let rcvE((l,m),k) <* inv£(/',m , ). 

We have to show that event £(/,m) < event e(1* ,m') in H. This follows from 
the fact that LIN preserves because rcv£((/, m),k) < k invE(l',m') implies 

event£(/,m) — ► event e{1' ,rn'). □ 

4.2.2 Existence of Cbcast Implementation 

Theorem 4.1 in the previous section gave a sufficient condition for a specification 5 
to be implementable with CBCAST. In this section we will show that this condition 
is not only sufficient but also necessary. This will show that our CBCAST implemen- 
tation really is the most general implementation based on CBCAST: every problem 
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that has a CBCAST solution is solvable with our implementation. In other words, 
the goal of this section is to prove the following theorem. 

Theorem 4.2 

A specification 5 has a CBCAST implementation 

3 constructive linearization operator LIN: 

V runs R: R locally correct =» LIN(R) € 5. 

The only if (“•$=”) direction is equivalent to Theorem 4.1 which we proved in the 
previous section. So, our task is the following: Given some CBCAST implementation 
Y of S, we have to derive a linearization operator that satisfies the conditions of 
Theorem 4.2. We will do this as follows: Given a run R our linearization operator 
will map this run to a history in two steps: 

R —* E —* H 

In the first step, we map a run R to an execution history E that has the same set of 
invocations as R and the same partial order as R. The second mapping is defined 
in terms of the behavior of implementation Y. If Y is correct then there must exist 
a legal history H that satisfies the correctness and liveness conditions with respect 
to E (Definition 3.12). We map E to this history. 

The first mapping is called T. We want to define this mapping in such a way 
that it preserves the partial order w — ►” on R. Given R we get a partially ordered 
set of invocation events simply by ignoring the return values of the formal events in 
R. The execution history E = T(i?) will have exactly these invocation events plus 
all the corresponding receive events. Notice that in a CBCAST execution history 
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the relation between invocation events already deter min es the order of receive 
events relative to invocation events in each processor history E{\ 

rcv E ((k , /),*') <» in v E (i,j) O mv E (Jc, l) -► inv E (i,j) 

The direction follows from the definition of the other direction from the 
CBCAST ordering axiom (Definition 4.5). Therefore, to completely determine E = 
T(R) we only need to specify the order of the receive events between two invocation 
events. The CBCAST ordering axiom already determines a partial order on these 
events; to define E = T(R) we can pick any linearization of this partial order. A 
topological sort will suffice. The next definition summarizes this procedure. 

Definition 4.10 

The function T : 71 — ► € maps a run R to an execution history E in the 
following way: 

(i) Ei — {a|3j,v: a:v = event R(t,j) e R} 

U {( k , /) | 3 k, l: eventR(k , l) € R} 

(ii) <f is the topological sort of the partial order on the events 

in Ei induced by ►” on R. 

The next lemma formally states the properties of this mapping. 

Lemma 4.0 

Let E = T(R). Then 

(i) E is a CBCAST execution history. 

(ii) inv E (i,j) -* invE(i',j') in E & even tR{i,j) -► eventR(i' ,j') in R. 

(iii) R!<R =► r(tf) < T(R). 
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Proof: (i) and (ii): As discussed above, we constructed T in such a way as to 

satisfy these two properties. 

(iii) Follows from property (i) and Definition 4.10(ii). □ 

We now use T to define a linearization operator LIN derived from an implementa- 
tion of 5. 

Definition 4.11 

Let Y as (n, I, V, M, Q, qo, '!') be a CBCAST implementation of S. 

H(R) = {HeS | H is a linearization of R Y (T{R))} 

\ “smallest” H in H(R) HR- Ry(T(R)) 

LIN Y (R) = \ 

( J. otherwise 

Notice that this definition implicitly assumes that H(R) is non-empty whenever 
R = R Y (T(R)). 

Lemma 4.7 

If Y is a correct implementation of 5 then R — R Y (T(R)) implies 'H(R) 0, 

hence LINy is well defined. 

Proof: Let E = T(R). II Y is correct then there exists a history H 6 S that satisfies 
the correctness and liveness condition of Definition 3.12. Correctness and liveness 
imply that if is a linearization of Ry(E). Therefore H € 7f(f2) if R = Ry{E). □ 

We now proceed to show that if Y is a correct CBCAST implementation of 5 then 
LINy indeed satisfies the conditions of Theorem 4.2. 

Lemma 4.8 


LINy is constructive. 



i 


68 


Proof: Let R be a run such that LINy(R) € S. We have to show that for every 
invocation a G I there exists a reiu^ value v such that LINy(R + a:v) € 5. 

Let E = T(R). If LINy(R) J_, Definition 4.11 implies R = Ry(E). Consider an 
execution E' that is identical to E, except that it has one more invocation event a 
at the end, i.e., 

E'i = Ei + a + (i,j + 1) where j is the number of invocation events in E{ 
E' k = Ek + (i,j + 1) for k ^ i 

Let R 1 = Ry(E'). Then R! = R + aw, where v = vai£/(», j + 1). By construction 
E' — T(Rf). Therefore, by Definition 4.11 LINy(R + a:v) = LINy(Rf) € S. □ 

Lemma 4.9 

Vruns R: R locally correct =► LINy(R) € 5. 

Proof: Induction on the number of events in R. The base case, R = 0, is trivially 
satisfied, because an empty history is always legal. 

For the induction step consider R £ 0. We have to show that LINy(R) £ S, 
which, by Definition 4.11, is equivalent to R — Ry(E ), where E = T(R). Let 
a:v = event &(i,j) € R be an event in R which corresponds to the invocation event 
a =s mv£{i,j) in E. To prove that R = Ry(E) we have to show that = v. 

Let e be a maximal event in R , and define Rf = J?[e] and R!' = R — {e}. Local 
correctness of R implies that LINy(Rf) € 5 and that J2" is locally correct. By 
induction hypothesis LINyiRf') € 5. Therefore, by Definition 4.11, R! = Ry(E' ) 
and BP = R Y (E"), where E' = T(R!) and E" = T(RP). 

Case 1 a:v — e: In this case aw 6 Rf. Because R! < R we have E 1 < E (Lemma 4.6). 
By Lemma 4.1, vai£(t,j) = val£»(i,;) = v. 


\ 
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Case 2 aw ^ e: In this case aw 6 R" ■ Because R!' < R we have E" < E 
(Lemma 4.6). Again, by Lemma 4.1 vai£(t,j) = va/^(t,j) = v. □ 

Proof of theorem 4.2: The u 4=” direction is equivalent to Theorem 4.1. The 

“=>•” direction follows from Lemma 4.8 and Lemma 4.9. □ 

4.3 Beast and Fbcast Implementation 

In this section we consider the problem of constructing implementations based on 
unordered or FIFO broadcasts (BCAST and FBCAST). We start by investigating 
BCAST implementations. 

The CBCAST protocol presented in in [BJ87b] implements causal ordering on top 
of unordered message channels by a method called “piggybacking” . Every broadcast 
message is augmented by previous messages it might depend on before the message 
is sent out. This way causal ordering can be achieved without multiple phases of 
message exchanges. We use this idea to translate our CBCAST implementation from 
Figure 4.1 into an equivalent BCAST implementation of the same specification. As 
in the CBCAST implementation every processor keeps track of all events and their 
partial order. When client * invokes an operation a, processor i not only broadcasts 
the event e = a:v, but also the whole set of events that precede e under . An 
informal description of the BCAST implementation is given in Figure 4.3. In the 
rest of this section we will translate this implementation into our formal execution 
model and show that it is correct under exactly the same conditions for 5 and LIN 
as the CBCAST implementation. 
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Processor i runs the following program: 

R := empty; 

loop 

wait for an invocation by the local client or the receipt of a broadcast; 
if client invoked operation a then 

pick a value v, such that LIN(R + a:v) € S 
R R + a:v ; 

BCAST (R) to all processors; 
return t; to the client; 
else if broadcast (Rf) was received then 
R := R U Rf ; 
end if 
end loop 


Figure 4.3: BCAST implementation 
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Given a specification S = (n, /, V, 5) and a constructive linearization operator 
LIN, we define the implementation Ys,lin as follows: 

Y S ,lin = (n, J, V, (/ x V)# x V, (/ x F)#, 0, $, #), 
i.e., AT = (J x V)# x V, Q = (I x V)*, q 0 = 0, 

where (/ x denotes the set of all runs than can be constructed from events in 
/ x V. The transition functions ^ and are defined as follows. When operation a 
is invoked at processor i in state R, it adds the event a:v to its run R, and broadcasts 
[R + a:v,u]: 

<t>i(R, a) = (R + a:v, [R + a:v, u]), where v = Xs t wv(R, a) 

When processor z receives a message it adds all events in R! to its run R, 

and the value v is returned to the client: 

The next lemma makes a statement about the state of a processor in Ys,lin 
at time when a processor completes a client request (i.e., returns a value to the 
client). 

Lemma 4.10 

Let £ be a BCAST execution history and R = Ry SUN {E). Let e = a:v = 
eveat£{i,j) be an event in R and let R[z, /] =s mv£{i,j) be the corresponding 
invocation event in E. Then 

stat£[z, Z] = R[e] 

Proof: Let stat£[z,j] = r. Proof by induction on the number of events in R[e]. 

Base case: R[e] = {e}. Then a = inv£(z,j) is the first invocation event in E%. 
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Furthermore, there can be no receive events preceding a in Ei\ otherwise the corre- 
sponding formal events would be in i2[e]. Therefore the statu of p,- before the event 
e is go = 0. Hence r = 0 + e = {e}. 

Induction step: consider r with more than one event. Let / € r. Then e' was 
added to r when pi received a message m = r' with / € r' from processor i' . Let 
e' = event E(i' ,j') be the invocation that caused p,v to send this message. Then 
e' — ► e, because rcv£((i ' <,• mv£(i,j). Further more, by induction hypothesis 
r 1 = hence / — ► e 1 . By transitivity / — ► e. Therefore / € f2[e]. This shows 
i?[e] C r. 

Let e' = eventE{i' ,/) € fi[e], i.e., e' -*■ e. Then inv£(j',;') — ► invE(i,j). If i' = i 
then e' was added to r at the invocation invE(i,j'); hence e' € r. Otherwise, by 
definition of “—>”(4.1) there must be a receive event rcv£((i" ,j"),i) € such that 

inv E (i',j’) -> inv E (i",j") rcv E ((i" ,j"),i) <i inv E (i,j). 

Then e' — * e w = event E{i" ,]")• By induction hypothesis the state r" of pj// after 
the invocation event in.VE{i'\j") is equal to f2[e w ]. Therefore e' € r" = msgE(i" 
Therefore, when pi receives the message from pj« it contains e' which will then be 
added to r. This shows r C fZ[e]. We conclude that r = i2[e]. □ 

This lemma is the equivalent of Lemma 4.4 for CBCAST implementations. 
Theorem 4.3 

If LIN is constructive and satisfies 

Vruns R: R locally correct => LIN(R) € S. 
then Ys,LIN is a correct BCAST implementation of specification S. 
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Proof: The proof is the same as for Theorem 4.1 with references to Lemma 4.4 

replaced by Lemma 4.10 □ 

Theorem 4.4 

A specification 5 has a BCAST implementation 

3 constructive linearization operator LIN : 

Vruns R: R locally correct => LIN(R) € 5. 

Proof: The “■$=” direction is equivalent to the previous theorem (4.3). The “=»” 
direction follows from theorem 4.2, because every BCAST implementation for 5 is 
also a CBCAST implementation. □ 

Corollary 4.2 

A specification 5 has a FBCAST implementation 

& 

3 constructive linearization operator LIN : 

Vruns R: R locally correct ^ LIN(R) € 5. 

Proof: The “•$=” direction follows from theorem 4.4, because every BCAST im- 

plementation for 5 is also a FBCAST implementation. The “=»” direction follows 
from theorem 4.2, because every FBCAST implementation for 5 is also a CBCAST 
implementation. □ 
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4.4 Summary 

In this chapter we looked at the problem of constructing an implementation for a 
formal specification using broadcast protocols that are more efficient them abcast. 
Let <5 be the class of all formal specification and let Szbcast the the subset of S 
containing all specifications that have an XBCAST implementation (where xbcast 
stands for ABCAST, CBCAST, . . . ). We have shown that <5 separates into two 
distinct subclasses: 

— S abcast > $ cbcast ~ & fbcast ~ & beast' 

We call the second class Sasync because specifications in this class have implemen- 
tations with the following characteristic. When a client invokes an operation it is 
always possible to compute the return value immediately from local information. 
This way the client never has to wait for replies from remote sites; information is 
propagated asynchronously in the background. 

We showed that a specification 5 is a member of the class Sasync if and only 
if there exists a linearization operator for 5 (Theorem 4.2). This linearization 
operator can be used to automatically construct an implementation for 5. In the 
next chapter we will look at the problem of finding such an operator. 



Chapter 5 


Commutative Specifications 


In the previous section we gave a complete characterization for the class of specifica- 
tions that have an asynchronous implementation. Unfortunately as we will show in 
the next section, this class is non- recursive, i.e., in general the question of whether 
a specification has an asynchronous implementation is undeddable. This result 
shows that we cannot find a general algorithm that would automatically construct 
a suitable linearization operator from a given specification. Therefore, we have to 
investigate methods that could be applied to certain “simple” subclasses of speci- 
fications. In this chapter we explore the possibility of exploiting knowledge about 
the commutativity of operations in a specification in order to construct linearization 
operators. 

5.1 U ndecidabilit y 

Theorem 4.2 reduces the problem of constructing an asynchronous implementation 
for a specification 5 to the problem of finding a linearization operator LIN that 
satisfies the condition of Theorem 4.2. Unfortunately this problem is still very hard. 
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In fact, the example below shows that the general problem of deciding whether a 
specification S has an asynchronous implementation is undecidable. 

Consider a system with two processors in which client 1 may invoke a parame- 
terless operation a; client 2 may invoke an operation b with one integer parameter. 
Define the following class of specifications: 

Si — (2, /, V, Si), where 

/ = {ai} U {&,(«) | x € N} 

V = {0,1} 

Si = {(ai:0)} 

U { (6j(*):0) | * € N} 

U { (ai:0,&2(x):0) | x # hi} 

U { (oi:0, & 2 (x):l) | x € hi} 

U { (&2(x):0,ai:l) | x € N} 

where 

hi — {x | x is an encoding of a computation of the 
t’th Turing machine, T,-, in which T, halts} 

Lemma 5.1 

Si has an asynchronous implementation iff the Turing machine T,- never halts. 

Proof: Let LIN be a linearization operator that satisfies the condition of Theo- 
rem 4.2 for Si. Consider the run 


R = oi:0 // &2(x):0 
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Table 5.1: Linearization operator for Si 


R 

LIN(R) 

a j :0 

(< *i:0) 


aj:l 

JL 


h(x):0 

(M) 

for all x 6 IV 

h(x):l 

J. 

for all x € IV 

ai:0 // b2(x):0 

(ai:0, 62 (*): 0 ) 

for all x 6 IV 

ai:u // & 2 (x):v 

1 

ifu^Oorv^O 

o i:0 — ► Js(x):0 

(oi:0, &j(x):0) 

for all x 6 N 

aj.-u — ► bj(x):v 

1 

ifu^Oorv^O 

b 2(z):0 — ► ai:l 

(h(x):0, «i:l) 

for all x 6 N 

bi(x):u — ► a\\v 

1 

ifu^Ooru^l 

all other R 

1 


for some x € N. If LIN is constructive then LIN(9) = 065,' implies that there 
exists a return value v such that LIN(a: v) 6 S,\ The way Si is defined this is 
only possible if v = 0. Hence LIN(a i*.0) = (ai:0) 6 Si, and by the same argument 
LIN(bi(x): 0) = (& 2 (x): 0 ) 6 5,\ Therefore the run R satisfies local correctness. By 
Theorem 4.2 LIN(R) 6 5,*. This is only possible if LIN(R) = (ai:0, & 2 (x): 0 ), and 
x £ But A was locally correct for any x. Hence hi = 0, i.e., T,- never halts. 

Conversely, assume that Ti never halts. Define a linearization operator LIN by 
Table 5.1. The way Si is defined, a legal history has at most one event at pi and one 
event at p 2 - Consequently a locally correct run can have only two events. Therefore 
our table enumerates all possible locally correct runs. It is straight forward to check 
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that if hi = 0 then for every row in the table, either the history in the right column 
is legal, or the run in the left column violates local correctness. □ 

The lemma shows that, if we had a procedure for deciding if 5,- is simple for a given 
i, then we would have solved the halting problem for Turing machines. But since 
the halting problem is undecidable we have: 

Corollary 5.1 

The problem of finding those i for which 5, is simple is undecidable. 

Fortunately, hardly any problem that arises in real distributed systems has anything 
to do with Turing machine computations. In many cases the problem at hand can 
be solved despite the undecidability of the general case. 

5.2 Commutative Specifications 

The difference between an ABCAST execution history and a CBCAST execution his- 
tory, is that in the CBCAST case different processors may observe events in different 
orders. Therefore, it should be easier to construct a CBCAST implementation if 
certain events commute, that is, if their order in a legal history can be reversed 
without making the history illegal. We explore this idea in this section. 

5.2.1 Commutativity and Ordering Constraints 

We start by defining an equivalence relation on histories. Two histories are equiva- 
lent if no sequence of future events can distinguish them. 
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Definition 5.1 

Two histories, H and Hi are equivalent (Hi = Hi) iff 
V H: HiHsS # H 2 H 6 5. 

We can identify the equivalence classes of histories with the states the system can 
be in. Different histories in the same equivalence class represent different ways of 
reaching the same system state. We use this equivalence relation to distinguish 
between read-only events and update events. An event is a read-only event if it does 
not change the system state. 

Definition 5.2 

An event e is read-only iff 

VH: He€S =► H s He. 

Events that are not read-only are called update events. 

Note that whether a particular operation is read-only depends on the outcome (i.e., 
return value) of the operation. Consider, for instance, the PASS operation from 
our token passing example. The event Pi(x):ok is an update whereas P{(x):tH is a 
read-only event. 

We now turn our attention to specifications in which update events always com- 
mute. 

Definition 5.3 

Specification 5 is commutative iff 

V H : V a, 6 update events at different processors: 


Hae S A HbeS => Hab € S A Hba 6 5 A Hab = Hba. 


so 


Clearly, read-only events always commute with each other, but read-only events 
may may not commute with update events. Consider Ha, Hb € 5, where a is 
read-only and b is an update event. Then Hab S 5, otherwise a would not be 
read-only. It could be that also Hba 6 S; this would mean that the return value in 
a is not affected by the update b. Otherwise, if Hba 5, then a is affected by b, 
i.e., the return value in a is no longer valid if a is ordered after b. We denote this 
kind of dependency between two events by the symbol 

Definition 5.4 

a is invalidated by 6 (a b) iff 

3H: Ha€S AHbeS AHba$S. 

If ei h-* e 2 or ti *-+ e\ we also say that there is an ordering constraint between the 
two events. From the above discussion it is clear that 

Lemma 5.2 

If 5 is commutative then 

a i — ► 6 =» a is read-only and b is an update event. 

As an example let us consider various different events possible in our token 
passing service. The read-only events are 

Qi’.T, Qv.F, Ri'.tH, Ri'.tR, Pi(j):eH , Pi(j ) : eR, 

whereas the following are update events: 

Ri'.ok , Pi(j):ok. 

Note that in the traditional sense, PASS and REQUEST operations do not commute. 
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For example, the history 

H _ ( Ri.ok, Pi(3):ok ) 

would not be legal if we reversed the order of the two events. If the PASS operation 
is invoked before the REQUEST operation it should return ErrorRequest instead 
of OK. However, according to our definition the token passing specification is com- 
mutative. This is because we require two update events a, b to commute only if 
both events are legal independent of each other ( Ha € 5 and Hb € S for some H). 
Hence the fact that the two events Ry.ok and P\(3):ok do not commute does not 
affect the commutativity of the specification, because the second event (Pi(3):ok) 
is never legal without the first. Formally: 

-» 3 HeS: H + Ry.ok € 5 A H + Pi(3):ok € 5 

A complete analysis of the token specification shows that any two update events 
either commute or have the property that one is never legal without the other (see 
Appendix A). Hence the token passing specification is commutative. 

The intuitive reason for defining commutative specifications this way is the fol- 
lowing. If there are two updates a, b of this type (6 is not legal without a) then 
these two events will not occur concurrently in an execution. The two events will 
always be related by information flow (a — ♦ b). Because CBCAST preserves all 
processors will observe a before b. Therefore it does not matter whether the two 
events commute. 

In the token passing specification there are two types of ordering constraints: 
the first one between certain QUERY and PASS events 


Qi'.F Pj(i):ok, 
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the second one between an unsuccessful PASS event and a request event: 

Pi(j):eR Rjiok 

A complete table of dependencies for token passing events is given in the appendix. 

5.2.2 Applying Commutativity to Runs 

How do the concepts discussed in the previous section help us construct a lineariza- 
tion operator for a commutative specification? Our plan is the following: 

1. We assume that we can compute the ordering constraints ‘W’ between any 
two pairs of events. Given a run R, we construct what we call the closure of 
R (R) by adding extra edges to R: For all events a, b € R that are concurrent 
in R, we add an edge a — ► b if the ordering constraint a*-* b holds. 

2. Provided that R has no cycles, we define LIN(R) by arbitrarily picking a 
linearization H of R. That is, we pick a history H that contains the same 
events as R and has a total order that is consistent with ►’’and 

Figure 5.1 shows an example of applying this method to a run of the token passing 
service. It shows a run R and its closure. R is represented by the circles (events) and 
solid arrows (information flow relation between those events). R is given by R plus 
the dashed arrows (ordering constraints). We can get a legal history for this run by 
ordering all its events in such a way that the partial order given by the solid and 
dashed arrows is preserved. The history H given below the diagram shows one pos- 
sible linearization. We formalize this method below and show that the linearization 
operator defined this way works in the case of commutative specifications. 
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^1 


P\{2):ok Q X .F 



H = ( R 2 :ok, Qs.F, Px(2):ok, R 2 :ok, Qx.F, P 2 (3):ok, Qy.T ) 


Figure 5.1: An example run and one of its linearizations 


Definition 5.5 

The closure ~R of a run R is the run R augmented by edges between any two 
concurrent events a and 6, whenever a ► 6, or formally: 

a — ► 6 G H # (n — * b € R V a//6 € R A a ■ —♦6) 

Definition 5.6 

Tl(R) = {H € S | H is a linearization of 7?) 

\ “smallest” H in K(R) if TL{R) £ * 

I/iVsOR)^ V ' 

[ _L otherwise 

Recall from Theorem 4.1 that, to show that the CBCAST implementation will be 
correct with this linearization operator, we have to prove 

LINs(R) 6 5 for every locally correct run R, 

where local correctness means L/iV(il[a]) 6 S for every a € R- As defined above, 
LIN Si simply picks one possible linearization of H to map a run R to a history. 
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Hence in this case local correctness of R implies that every Z?[a] (for a in R) has a 
legal linearization. We call such a run weakly plausible: 

Definition 5.7 

R is weakly plausible O Va € R: 3 legal linearization of i?[a]. 

If not just one, but all linerizations of f?[a] are legal, then we call this run strongly 
plausible: 

Definition 5.8 

R is strongly plausible iff 

Va € R: 3 legal linearization of i?[a] 

A every linearization of Z2[a] is legal. 

The relationship between local correctness under LI Ns and strong and weak plau- 
sibility is the following: Strong plausibility implies local correctness, and local cor- 
rectness implies weak plausibility. We will show (Lemma 5.4 below) that for com- 
mutative specifications these two forms of plausibility are in fact equivalent. Hence 
a run is locally correct if and only if it is plausible (strong or weak). Therefore 
we only need to show that LINs(R) € S for strongly plausible runs R. The next 
lemma will allow us to do this. 

Lemma 5.3 

If A is strongly plausible then every linearization of R is legal. 

Proof: Induction on the number of events in R: Trivially satisfied for empty runs, 
because empty histories are always legal. Now assume R non empty: 
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Case 1: If R has a unique maximal element a then R = f?(a], and our claim follows 
from Definition 5.8. 

Case 2: Let H be an arbitrary linearization of 7£, let a be the last event in H, 
and let b ^ a be a maximal element of R. H can be written as H = H'bH"a. The 
history H\ = H'bH" is a linearization oi R — a. By induction hypothesis Hi is legal. 
Similar, Hi — H'H"a is legal as a linearization of R — b. Let H"' be equal to H" , 
except that all read-only events are removed from H m . If & is a read-only event 
then 


H - H'bH"a = H'H"a = € S 

and we are done. Otherwise, b commutes with every event in H"'\ hence 

H'H'"b = H'bH'" s H'bH" = Hi <=S, and 
H'H'"a = H’H"a = H 2 6 S. 

Then H'H m ba € 5, because otherwise there would be an ordering constraint aw J, 
but then a could not be the last event in a linearization of H. Therefore 

H = H'bH" a = H'bH'" a = H'H"'ba 6 5. 


□ 

We can now prove that for commutative specifications weak and strong plausibility 
are equivalent. 

Lemma 5.4 

If 5 is commutative then 

(i) Weak and strong plausibility are equivalent. 

(ii) Every linearization of a plausible run is equivalent. 
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Proof: (i) We have to show that every weakly plausible run R is also strongly 

plausible. Induction on the number o T events in R: Trivially satisfied for empty 
runs, because empty histories axe always legal. 

Now consider a non-empty, weakly plausible run R. Assume R is not strongly 
plausible. Then there must be a left subrun i?(a] with a legal linearization, say Ha , 
such that some other linearization H'a is not legal. These two histories only differ 
in the order of events that are concurrent in i2[a]. We may transform one into the 
other by swapping adjacent concurrent events. Thus we get a sequence of histories 

Hia, H 2 a, H 3 a, . . . , H n a , where Hi = H and H n = H ' , 

in which Hi and Hi+i differ only in the order of two adjacent events. If H'a & S 
then the sequence must contain two consecutive histories, 

Hi = Ab\b%B, ifj+i = Ab^biB 

such that Hia is legal but Hi+ia is not. Note that Hi and Hi+i are linearizations 
of R! = i?[a] — a. By induction hypothesis R! is strongly plausible. By Lemma 5.3 
Hia and Hi+\a axe both legal. Because specifications axe prefix-closed, A61&2 and 
A62&1 must also be legal. If 5 is commutative then these last two histories are 
equivalent, and hence Hia and Hi+ia should either both be legal or both be illegal. 
This contradicts our earlier assumption. 

(ii) Let H and H 1 be two linearizations of a plausible run R. We have to show that 
H' = H. H and H' differ in the order of events that axe concurrent in ~R. Again, we 
transform one into the other by swapping adjacent concurrent events, leading to a 
sequence of histories: 

Hi, H 2 , H 3i . . . , H n , where Hi = H and H n = H'. 

c-x 
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We can write H{ and Hi+i as 

Hi = AbihB, H i+ i = AbtbiB. 

From part (i) we know that R is strongly plausible; hence, by Lemma 5.3, Hi and 
Hi+ 1 are both legal. Therefore their prefixes Ab\ and Abi are legal. If one of the 
two events (say b\) is a read-only event then 

A61&2 == A62 = A&2&1- 

If both events are updates then they must commute. In any case, we have Ab\b^ = 
A& 2&1 and therefore Hi s fT;+ j. By transitivity H s H'. □ 

This lemma now allows us to show that the linearization operator we introduced 
in this section (Definition 5.6) can be used to construct asynchronous implementa- 
tions. 

Definition 5.9 

A commutative specification 5 is acyclic iff 
V R: R plausible =>• 7? acyclic. 

Otherwise we say 5 is cyclic . 

Theorem 5.1 

If 5 is commutative and acyclic then the CBCAST implementation with LI Ns 
as its linearization operator is correct. 

Proof: We will show that if every plausible run has an acyclic closure then LI Ns 
is constructive and satisfies LI Ns(R) € 5 for every locally correct run R. The claim 
then follows from Theorem 4.2. 


88 


(i) LINs is constructive: Let R € 5, a 6 I. We have to show that there is a return 
value v 6 V such that LINs(R + a:v) 6 S. LINs is defined in such a way that the 
order of events in H = LINs(R + a:v) is independent of the choice for the return 
value v, that is 

LINs(R + ci'-v) = LINs(R) + a:v, for all v such that LINs(R) + a:v 6 5 

Because specifications are complete (Definition 3.3) LINs{R ) € S implies that such 
a value always exists. 

(ii) LIN(R) € S for every locally correct R: Local correctness of R implies that 

R is weakly plausible. By Lemma 5.4, R is strongly plausible. By Lemma 5.3, 
every linearization of ~R is legal. If 7? is acyclic then such a linearization exists, and 
H(R) £ 0. Hence LIN S {R) 6 5. □ 

5.2.3 Proving Acyclicity 

In Chapter 4 we gave an example of a specification for a simple counter that does 
not have a CBCAST implementation. This specification is commutative: READ oper- 
ations are read-only and INC operations commute. However, the acyclicity require- 
ment in Theorem 5.1 is not satisfied as the example in Figure 5.2 shows. The run 
in this figure is plausible; for example 

R[Rcad\ : 6] s= /ncj(l) — > Inc\(5) —> Read\ : 6 

has only one linearization, and this linearization is legal. However the closure of R 
has a cycle 

7nci(5) — + Readi : 6 t-* Incz(3 ) — * Readz • 4 •-+ Jnci(5). 

As we have already seen in Section 4.2.1, this problem has no asynchronous imple- 
mentation. 
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/nci(5) Read\:6 

Pi 

Inc 2 ( 1) 

P2 

P3 

Incz(Z) Reads A 

Figure 5.2: An example run 

In this section we will present techniques for deciding whether a specification 
is cyclic or not. We will illustrate our techniques by applying them to our token 
passing example. 

Definition 5.10 

Let R be a run with a cycle C in 7?: 

C = ei,i -*• ei.j — ► ... -♦ ei,n t 

ej,i -* e 2< 2 -+ ... -*■ ej f », 

I • • 

®m,l “♦ 2 e m,n m 

*-► « 1,1 

We call 

C.,1 ~ C«,2 — *■ • • • — ► c i,n< 

(for i = 1 . . . m) a segment of the cycle. 





Lemma 5.5 

Every cycle in the closure 77 of a run R has at least two segments. 

Proof: Because R itself is acyclic, every cycle in 77 must contain at least one ►” 
edge. Consider a cycle with only one segment: 

C = e\ — * t 2 . . • *h ► e n •— * ei 

According to Definition 5.5 77 contains edges only between events that are 
concurrent in R. Since e\ — ► e n (by transitivity) 77 cannot contain the edge e n *-*• e\. 
□ 

Lemma 5.6 

If 77 has a cycle then it also has cycle in which 

(i) all segments are concurrent, i.e., a/fb for any two events a and 
b in different segments. 

(ii) every segment has at most two events. 


Proof: (i) Let 
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be a cycle in H. Assume C has two non-concurrent segments. Then there are two 
events a = e,-,j and b = e^/, such that a — ► b in R. We can use ,L>s telation to 

b — *■ e fc,/+i -» • • • “+ Cfc,n k 

e *+l,2 • • • 

— e *J - 1 -*■ a 

The cycle C' has strictly less segments than C, because C' does not contain 

a e *,i+i “*♦••• c *,tn 

>-* «»+l,l . . . -♦ e»+i,i+i 

^ e k , i — * • • • CJfc,/— i -+ b 

which has been replaced by the “short cut” a — ► 5. We repeat this process until the 
resulting cycle no longer has non-concurrent segments. Lemma 5.5 ensures that the 
process need only be repeated a finite number of times. 

(ii) Consider a segment 


construct a smaller cycle 

C' = a — 
Cfc+1,1 


«i, 1 


««',! e «\2 • — *■ C », 


that has more than two events. Because of the transitivity of this segment can 
be replaced by the shorter segment 

e «\l Ci, ni . 


□ 
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This lemma expresses the following intuitive idea: The CBCAST implementation 
breaks down if two different processors take mutually inconsistent actions without 
knowing about the others action. The inconsistency of these actions is expressed as 
a cycle in a run. The fact that the two processors do not know about each other’s 
actions is expressed by the corresponding events being concurrent. 

Specifications that are acyclic have the property that certain types of events 
which a^e part of ordering constraints can never occur concurrently in a plausible 
run. We call such events mutually exclusive. 

Definition 5.11 

Two events a and b are mutually exclusive under 5 if 
V R: /{plausible ^ -> a//b. 

We prove a specification to be acyclic by showing that any cycle in the closure of a 
plausible run would contain mutually exclusive events in different segments of the 
cycle. This would force all cycles to have non-concurrent segments. However, this 
contradicts Lemma 5.6, which we just proved. 

Let us return to our token passing example. We will now prove that every 
plausible run of the token passing specification has an acyclic closure. 

Theorem 5.2 

Consider the token passing example. Two successful PASS events of the form 
a = Pi(x):ok , and b = Pj(y):ok , for t ^ j 


are mutually exclusive. 
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This claim is very intuitive. If two such events were not mutually exclusive there 
could be two processors holding the token at the same time, violating the token 
passing specification. 

Proof: Consider a run R with two concurrent pass events a = Pi(x):ok and 

b = Pj(y):ok. We show that R cannot be plausible by induction on the n umb er of 
events in R. 

Base case: R contains no events other than a and 6. Assume R is plausible. Then 
f2[a] = a and f2[6] = b must have legal linearizations H a = (a) and Hi, = (b) 
respectively. Because processor 1 is the initial token holder H a = ( Pi(x):ok ) can 
only be legal if * = 1. For the same reason ffj is only legal if j = 1. But z = j 
contradicts the assumption that a and b are concurrent. 

For the induction step consider R with more than two events. Assume R is plausible. 
Then i?[a] and fZ[6] have legal linearizations H a and Hi respectively. Let R! = 
.R[a] n R[b]. By induction hypothesis R[a), f?[6], as well as R! do not have concurrent 
pass events. Therefore we can define the following events: 

c = P((z):ok The last pass event in R. 

a' = Pi(x):ok The first pass event after c in i2[a]. 

V = Pj(y):ok The first pass event after c in A[6]. 

(where possibly, but not necessarily o' = a and/or hi = b.) Note that a 1 //hi because 
otherwise either a 1 or hi would be in R. Then the histories H a and ffj have the 
form 

H a ss ... Pi(z):ok . . . Pi(x):ok ... a 
Hi a* ... Pi(z):ok . . . Pj(y):ok ... b 
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with no pass events between c and a' in H a and between c and b' in H\>. Then H a 
can only be legal if i = 2 , otherwise the operation P,(i) should return an error code 
eH. For the same reason H\, is only legal if j = z. Hence t = j, i.e., a 1 and b' are 
events at the same processor. But that contradicts a'//V. □ 

Not only pass operations but any two events that indicate that the caller is the 
current token holder are mutually exclusive: 

Theorem 5.3 

Any two events of the following types are mutually exclusive: 

Qi’.T, Ri'.eH, P{(x):ok , or Pi(x)'.eR 

The proof is very similar to the one for Theorem 5.2; it is carried out in Appendix A. 

Now let us consider the ordering constraints occurring in the token passing spec- 
ification. These constraints are of one of the following three types (see Appendix A): 

(I) QnF ~ Pj(i):ok 

(II) Ri'.eR Pj(i):ok 

(III) Pi(j):eR Ry.ok 

Theorem 5.4 

The token passing specification is acyclic. 

Proof: Assume not. Let R be a plausible run that contains a cycle. By Lemma 5.6 
we may assume that the cycle only has concurrent segments. Consider the ordering 
constraint edges (‘W’) in such a cycle. The cycle cannot contain more than one 
edge of type (III), otherwise there would be two pass events in different segments 
of the cycle, which is not possible since segments are concurrent and pass events 
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axe mutually exclusive. For the same reason there cannot be more than one edge of 
type (I) or (II) in the cycle. By Lemma 5.5 the cycle has at least two ‘W’ edges; 
hence it must have exactly one edge of type (III) and one of type (I) or (II). Hence 
the cycle is of the following form: 

C = Pj(i):ok — Pk(l):eR 
*— ► Rf.ok — ► e 
Pj(i):ok 

where either e = Qi.F ox e = Ri.eR. 

The first segment of the cycle consists of the two pass events a ~ Pj(i):ok and 
b = Pjfc(/):ei2. If R is plausible than /?[&] has a legal linearization fTj. Because 
a — ► b, a is in J2[6] and therefore also in H k . Hence H\, has the form 

H h = ... Pj(i):ok ... P k (l):eR 

Notice that the return value eR of the last event (the pass operation failed because 
processor / did not request the token) indicates that processor k is holding the token 
at that time. Therefore H k must contain a pass event c = Pi(x):ok between the two 
events a and b in £Tj; otherwise processor i would still be holding the token at the 
end of H k . From Theorem 5.3 we know that c cannot be concurrent with a or 6; 
hence 


a c—* b. 

Now consider the event e in the second segment of the cycle. Events c and e cannot 
be concurrent, because the operations were both invoked at processor i. If c — » e 



96 


we have a — ► c — ♦ e; hence a — ► e. If e — *• c we have e — ► c — *• 6; hence e — ♦ b. In 
'■‘ova cases the two segments of the cycle would not be concurrent, contradicting 
Lemma 5.6. □ 

Let us summarize our techniques for deciding whether a specification is acyclic. 
Lemma 5.6 allows us to restrict our search for cycles to certain simple types of cycles 
with the following three properties: 

1. The cycle has at least two segments, i.e., it contains at least two edges. 

2. Every segment has exactly two events. 

3. All segments are mutually concurrent. 

Because of properties 1 and 3, such a cycle must have concurrent events that occur 
in an ordering constraint. Therefore we are successful if we can show that events 
that are involved in ordering constraints are mutually exclusive, i.e., do not occur 
concurrently in a plausible run. 

5.3 Mixed Implementations 

The techniques we outlined in the previous sections are still useful if a specification is 
cyclic or even if it is not strictly commutative. In the case where these techniques fail 
to produce a correct asynchronous implementation for a specification 5, it is often 
not necessary to resort to an implementation that is based on atomic broadcasts 
only. Instead, it is often possible to construct a mixed implementation in which most 
events are propagated with CBCAST and ABCAST is used only for certain “critical” 
events. For example, consider a service for managing shared data. If clients are 
required to explicitly acquire locks before modifying the data, then only LOCK and 
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UNLOCK operations need to be globally ordered. Once a lock is granted the actual 
updates may be propagated asynchronously [JB86]. The techniques developed in 
the previous sections of this chapter allow us to identify what events are “critical”: 
events that do not commute and events that occur in cycles. 

In this section we will outline how the results from this and the preceding chap- 
ter can be generalized to apply to such mixed implementations. We modify our 
definition of implementation by adding a parameter A defining the set of “criti- 
cal ” operations that must be propagated by an atomic broadcast. That is, an 
implementation is now a 9-tuple: 

(n,J, V,M,Q,qo,$, ¥,X), where A C I. 

We also need a new ordering axiom that defines mixed implementation histories: 

Definition 5.12 

Ordering axiom for mixed implementations: 

(i) Causal ordering: 

inv E (i,j) inv E (l,m) Vfc: rcv E ((i,j),k) < k rcv E {{l,m),k) 

(iia) V inv E {i,j) 6 A: (Global ordering) 

V i\j' : Vk,l: 

rcvE({*J), *) <* rcv E ({i'J'),k) * rcv E ((i,j),l ) <, rcv E ( (i',j'), l ) 

(iib) Vinv E (i,j) € I — A: (Immediate local delivery) 

V ij: -> 3 a : inv E {i,j) < k a <* rcv E ({i,j),i) 

The axiom requires that all message delivery must be consistent with (i), 
that all message delivery must be globally ordered with respect to messages sent by 
atomic broadcast (iia), and that messages sent by CBCAST are immediately delivered 
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locally. Notice that if A = 9 (all events propagated by CBCAST) this definition is 
equivalent ■ tue CBCAST ordering axiom (Definition 4.5). 

A mixed implementation is constructed the same way as a CBCAST implemen- 
tation, based on a linearization function. However, the correctness condition can 
be relaxed, because certain types of runs cannot occur in an execution of a mixed 
implementation. We formalize this below: 

Definition 5.13 

A run R is called permissible under A if events with invocations in A are 
globally ordered with respect to all other events: 

VeeAxV: Ve' € R: e//e 

Lemma 5.7 

Let Y be a mixed implementation and let E be a mixed execution history. 
Then Ry(E) is permissible. 

Proof: Follows immediately from Definition 5.12 (iia). □ 

Because of this property of mixed implementation, the correctness condition that 
we established for CBCAST implementation (Theorem 4.1) needs to be satisfied only 
for permissible runs: 

Theorem 5.5 

If LIN is constructive and satisfies 

V permissible runs R: R locally correct => LIN(R) € S. 
then Ys,LIN is a correct mixed implementation of specification 5. 
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Proof: We show that for every mixed execution history E, the history H = 

LIN(Ry(E)) is legal and satisfies the correctness and liveness conditions of Defini- 
tion 3.12. 

Let E be a mixed execution history, and let R = Ry(E). By Lemma 4.5, R is 
locally correct. By Lemma 5.7, R is also permissible. Therefore, by assumption, 
the history H = LIN(R) is legal. The rest of the proof is exactly the same as the 
proof of Theorem 4.1 on page 64. □ 

This theorem also allows us to generalize the results of Section 5.2.2. 

Corollary 5.2 

If a commutative specification 5 satisfies the following condition 
V permissible R: R plausible => If acyclic, 
then the mixed implementation under LINs (Definition 5.6) is correct. 

In the previous section we demonstrated how to prove that a specification is 
acyclic by showing that events that could form a cycle do not occur concurrently 
in a plausible run (i.e., are mutually exclusive). Events that could form a cycle but 
are not mutually exclusive cause this technique to fail. However, by Corollary 5.2, 
a mixed implementation will be correct if such events are propagated by atomic 
broadcast, ensuring that they do not occur concurrently in a permissible run. 

In a similar way it is possible to extend our results to mixed implementations 
in which events that are not commutative are propagated by atomic broadcast. 
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5.4 Summary 

In this chapter we saw that in general, the question of whether a linearization op- 
erator exists for a given specification is undeddable. Hence there are no general 
methods for finding such operators. Therefore, we considered a restricted class 
of specifications which we call commutative. We show how to exploit knowledge 
about the commutativity of events to construct linearization operators for specifi- 
cations in this class. These methods are useful for developing efficient asynchronous 
implementations for a broad range of practical problems. 



Chapter 6 
Failures 


In our treatment so far we have assumed a distributed system that is perfectly 
reliable. However, one of the main uses of broadcast protocols is in the design of 
fault- tolerant programs. In this chapter we will address the problems that arise if 
we take processor failures into account. 

6.1 Integrating Failures into the Model 

In chapters 3 and 4 we showed how to take a formal specification of a centralized 
service and use broadcast protocols to construct a distributed implementation of 
this service. Now we want to make the distributed service fault tolerant. What we 
mean by “fault tolerant” is that even if some processors fail, the behavior of the 
distributed service should be indistinguishable from a perfectly reliable centralized 
server. As we will see in this section we achieve this goal simply by replacing the 
broadcast protocols used in the implementations constructed in Chapters 3 and 4 
by reliable versions of the same protocol. In other words, if the broadcast protocol 
used in an implementation provides atomic message delivery, the implementations 
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will automatically be fault tolerant. 

To be more precise, we mus+ aa.* failures to our execution model. An execu- 
tion history may now contain failure events in addition to invocation and receive 
events. We modify our definition of execution histories (definitions 3.4 and 3.7) 
accordingly: 

Definition 6.1 

An unreliable execution history E = (E \ , . . . , E n ) is a collection of ordered sets 
of invocation , receive , and failure events, 

E € [ (/ U IV 2 U {FAIL})* ] n , 

satisfying the following conditions: 

(i) Reliable message delivery: 

V inv E (i,j) : V k: 3 unique receive event (*, j) € 2?* 

(ii) Sequential invocation: 

V ij : rcv E ((iJ),i) <,• inv E (i,j + 1) 

(in) Monotonicity of time: 

_ _ D D D D 

d Cj) • • • y fifn € * Cl C2 ^ c m Cl. 

(iv) No invocation events after a failure: 

V k : FAIL € Ek ^ 3 invocation event a € Ek : a >* FAIL 

Conditions (i - iii) are exactly the same as in definitions 3.4 and 3.7. For notational 
convenience we pretend that a processor that has crashed still receives broadcasts. 
Hence a failure is simply an event after which a processor stops sending any new 
messages (condition (iv)). Figure 6.1 illustrates such an execution history. Notice 
that this model describes an implementation based on reliable broadcast protocols, 
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£l = (2,1) c (3,1) (1,1) (3,2) 

E, = a FAIL (2,1) (3,1) (1,1) (3,2) 

E i = (2,1) h (1,1) (3,1) i (3,2) 

E t = (2,1) (3,1) FAIL (1,1) (3,2) 


Figure 6.1: Ao execution history with failure events 
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because condition (i) in Definition 6.1 ensures atomic message delivery. 

The definition of an implementation as an 8-^ 4 /le (n, J, V, M, Q, qo, '£) remains 
unchanged, but we have to specify the effect of failure events. We do so by defining 
the state of a processor after a failure to be undefined ( _L), that is we modify the 
definition of stat£{i,j] as follows: 


Definition 6.2 

stat£[t,;] 


90 

1 


= < 


#(stat£[i,j-l],a) 

V>f(stat£[z,;-l],m) 


if j — 0 

if E[i,j] = FAIL 

if stat£[*,j-l] = -L 
if E[i,j] = a is an invocation event 
if E[i,j] = (fc, l) is a receive event, where 
m = <j>%(st&tE[k, inumE(k, invE{k, l)) 


The definitions of msgE(i,j ), vai£(:,j), events(i,j), H[E,i] remain the same as 
before (Definition 3.11). In an unreliable system we define an implementation to be 
correct if all operational sites cannot distinguish its behavior from that of a perfectly 
reliable centralized service: 

Definition 6.3 

Y is a correct XBCAST-implementation of specification S = (n, J, V, S) iff: 

V XBCAST execution history E: 3 H € S: 

Correctness : Vi: if FAIL £ E{ then H |i = H[E, i] 

Liveness: Vi,j,k: 


rcv E ((i,j), k) < k mv E {k, l) => event£(i, j) <h even t£(fc, /) 
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The fact that in our execution model message delivery is reliable ensures that 
processors tbit a not fail axe not affected by the failure of other processors. This 
is expressed in the following lemma: 

Lemma 6.1 

Let E be an execution history with failure events, and let E ' be identical to E 
except that all failure events are deleted from it. Then 

(i) E' is an well formed execution history. 

(ii) Vi: if Ei does not contain a failure event then H[E, i] = H[E', *]. 

Proof: (i) As an unreliable execution history, E satisfies condition (i - iii) in 

Definition 6.1. Condition (i) ensures that E' is an execution sequence according 
to Definition 3.4; conditions (ii, iii) ensure that this execution sequence is a well 
formed execution history (Definition 3.7). 

(ii) By construction, all invocations in H[E, t] and H[E',i] are identical. Hence we 
only have to show that all return values are also the same. Consider the formal 
event e = a:v = eventful, j) 6 H[E,i\. Then a = mv£(i,j) and v = valE(i,j). 
Let b = E[i,mum£((i,j), i) — 1] be the corresponding receive event. Recall that 
according to Lemma 4.1 val£(i,j) only depends on events that precede b under 
Hence we are done if we can show that £[b] = £?'[&]. Assume that E[b) ^ E'[b]. 
This is only possible if £[&] contains a failure event /. Because a processor does 
not send any messages after it fails, the only events related to a failure event / are 
receive events after / at the same processor. Hence / € E[b] implies that / € Ei 
contradicting our assumption that Ei does not contain a failure event. □ 
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Theorem 6.1 

Every correct implementation of a specification 5 is also fault- tolerant. 

Proof: Let Y be a correct implementation of 5 and let £ be an unreliable execution 
history. Let E' be E with failure events deleted. Because Y is correct, there exists a 
history H 6 5 that satisfies the correctness and liveness conditions with respect to 
E'. By Lemma 6.1 H[E, z] = H[E',i] for all £,• with no failure events. Therefore H 
will also satisfy the correctness and liveness conditions with respect to £ as stated 
in Definition 6.3. □ 

6.2 Client Failures 

A processor failure not only affects a component of a distributed service, but also 
the client running at that site. The designer of a distributed service may want to 
explicitly specify a particular action to be taken if a client fails. In the token passing 
service, for example, it is desirable that the token is not lost if its current holder 
fails. Hence, we would want to specify the behavior of the token passing service in 
such a way that the token is automatically transferred to some other client in the 
case of a failure. 

This problem can be solve within our formalism by treating a client failure like 
any other operation invoked by a client. In other words, the specification is designed 
as if a client invoked a special operation “CRASH” just before its processor fails. If 
the distributed system provides a means of detecting failures, such a specification 
can be implemented in the same way as specifications that do not contain client 
failures. For example, the ISIS system provides a failure detection and notification 
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mechanism that makes the failure of a processor look as if the processor sent out a 
broadcast announcing its de~t,h just before the failure [BJ87b,BJSS86]. 


6.3 Summary 


In this chapter we showed that reliable broadcast protocols can be used to construct 
a fault- tolerant distributed service. This approach is very similar to the method of 
replicated state machines described by Schneider in [Sch86]. 



Chapter 7 
Conclusion 

7.1 Summary and Discussion 

We considered a variety of reliable broadcast protocols that differ in the form of 
message ordering they provide: atomic broadcast (ABCAST), causal broadcast (CB- 
CAST), FIFO broadcast (FBCAST), and unordered broadcast (BCAST). The stronger 
the ordering property of the protocol the more costly its implementation. There is a 
fundamental difference between atomic broadcast and the other forms of broadcasts. 
An atomic broadcast protocols requires at least two phases of message exchange, 
whereas CBCAST, FBCAST, and BCAST can be implemented as one-phase protocols. 
Furthermore, in an unreliable system in which processors may experience failures 
ABCAST can only be implemented if failures are detectable or if an upper bound on 
message delays is known. CBCAST, FBCAST, and BCAST, on the other hand, can be 
implemented reliably in a completely asynchronous system. 

Our results from Chapter 4 show that this fundamental difference is also reflected 
in the classes of problems that can be solved with a particular broadcast protocol. 
We showed that the class of all formal specification separates into two distinct 
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subclasses Sasync a-nd S — Sasync, which correspond to specifications that have 
an implementation based on CBCAST, FBCAST, or BCAST, and specifications that 
require the global ordering that ABCAST provides. 

For specifications in Sasync > an implementation can be expressed in a canon- 
ical form based on a linearization function for that specification. Although the 
existence of such a function in general is undeddable, it is possible to analyze com- 
mutativity and dependencies between events to find linearization functions for a 
subclass of Sasync • The methodology introduced in Chapter 5 allows identification 
of conflicting events and establishes conditions that allow the construction of an 
asynchronous implementation. Specifications for which this method is successful 
could be characterized as “self-synchronizing” , that is the specification itself pre- 
vents certain conflicting events from occurring concurrently. Even if our techniques 
fail to yield a completely asynchronous implementation they are still useful for con- 
structing a mixed implementation, as they identify a subset of events that need to 
be propagated by atomic broadcast. 

7.2 Future work 

It should be possible to extend the results from Chapter 5. Notice that the condi- 
tions presented in that chapter are sufficient but not necessary for the existence of 
an asynchronous implementation. This naturally raises the question of whether the 
methodology can be generalized to cover a larger set of specifications. For example, 
there are non-commutative specifications that have asynchronous implementations. 
Examples are specifications in which only operations invoked by one particular pro- 
cessor are sensitive to the order of the events. For example, one can modify the 
token passing specification to require token requests to be serviced in FIFO order. 
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Such a specification is not commutative, because token requests no longer commute. 
However, because only the current token holder decides which processor receives th^ 
token next, it is not necessary that token requests are globally ordered. 

Another interesting problem is to generalize our formalism to allow implemen- 
tations that exhibit “temporary inconsistencies”. In Chapter 4 we showed that 
the problem of implementing a shared counter does not have an asynchronous so- 
lution. However, for certain types of implementations it might be acceptable if a 
read operations returns the sum of only a subset of previous increments, as long 
every increment is eventually reflected in all future reads. The formalism presented 
in Chapter 3 allows us to relax the shared counter specification to allow reads to 
return partial stuns. However, because we need specifications to be prefix-closed, we 
cannot express the requirement that increments are not ignored forever. One way 
of solving this problem might be to define specifications as sets of partially ordered 
sets of events (i.e., runs) rather then sets of histories. One could then specify a 
shared counter in such a way that a read operation is allowed to ignore am incre- 
ment only if it is concurrent to the read. The drawback of this approach is that 
specifications no longer have the intuitive meaning of ensuring that the distributed 
program behaves behaves like a centralized server. 



Appendix A 

An Example: Token Passing 

A.l Formal Specification 

We want to implement a distributed token passing algorithm. The client interface 
consists of the following three operations: 

• QUERY0: BOOLEAN 

— returns TRUE if the caller is the current token holder. 

• pass(x: ClientId): ReturnCode 

— passes the token from the current token holder to client x. 

This operation returns one of three values: OK, ERRORHOLDER (the caller is 
not the current token holder), or ErrorRequest (client x did not request 
the token). 

• request(): ReturnCode 

— request the token. 


Ill 
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This operation returns one of three values: OK, ERRORHOLDER (the caller 
is already holding the token), or ErrorREQUEST (the caller has already re- 
quested the token). 

We use the following abbreviated notation for operations and return values: 

Q QUERY 

P PASS 

R REQUEST 

T TRUE 

F FALSE 

zH ErrorHolder 

eR ErrorRequest 


Given a formal history H, we identify the current token holder, CurHold(H), to be 
the client that token was last passed to, where client 1 is the initial holder of the 
token: 


[ 1 if JET 
= I 1 if th* 


does not contain any successful PASS operations. 
CurHold(H) — < x if the last successful PASS event in H has 

the form P{(x):ok, for some i. 

We can further define a predicate that tell us whether there is a pending token 
request by a particular client: 


TRUE 


PendReq[H,x) = < 


{ FALSE 


A formal specification 5 for our 
recursive definition: 


if Rxiok € H and if H does not contain 
an event Pi(x):ok after this request, 
otherwise. 

token passing example is given by the following 
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1. 0 € 5 

2. VHeS: let x = CurHold(ff ): 

(i) VyjLx: H + Q t :FeS. 

and H + Qt'-T 6 5. 

(ii) Vj: if PtndReq(H,j) then H + P x (j):ok € S. 

if -iPendReq(H,j) then H + P t (j):eR € 5. 

(hi) Vy^x: V;: H + P y (j):eH € 5. 

(iv) Vy jkx: if -» PtndRtq^H, y) then H + R^.ok 6 5. 

if PcndRcq(H,y) then + Ry.eR € 5. 

(v) H + Rt.eH € 5. 

3. 5 is the smallest set satisfying the above. 

The specification 5 says that the QUERY should always return TRUE to the current 
token holder, and that only the current holder is allowed to pass the token to any 
other client. This specification describes an idealized token passing system, in the 
sense that passing a token is supposed to be an instantaneous event: A PASS oper- 
ation takes effect immediately, because any operation following a PASS is required 
to reflect the new token holder. Fortunately our definition of implementation cor- 
rectness only requires that the behavior of the system is indistinguishable from this 
idealized behavior. To illustrate this point, consider a system with only two pro- 
cessors. A PASS operation may be implemented by simply sending a message from 
the previous to the new token holder. An external observer may see the following 
history: 


tf = (Pi(2), Q 2 :F, Q 2 :T) 
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Obviously the pass message was delayed a little so that only the second QUERY 
operation returned client 2 as the current token holder. Although H & S we still 
consider the implementation correct, because, to the clients, H is indistinguishable 
from the legal history 

H = ( Qi.F, Pi(2), QvT). 

A. 2 Commutativity and Ordering Constraints 

Next we apply our theory from Chapter 5 to show that the token passing example 
indeed has a CBCAST implementation. We start by computing a table of dependen- 
cies (Table A.l) between events. There are two pairs of events which are completely 
interchangeable and need not be considered separately: 

Ri.eH = Qi’.Ty and 
Pi(x):eH = Qi:F. 

For this reason these events share the same row and column in Table A.l. 

The table allows us to verify that the token passing specification is commutative. 
There are two types of update events: 

Ri'.ok, and P{(j):ok. 

According to Definition 5.3 we have to show that 

V H: V a, b update events at different processors: 

Hae S A Hb€ S =* Habe S A Hba 6 S A Hab = Hba. 

Table A.l shows that for any two such update events a and 6, either the two events 
commute ( o entry in the table), or there is no H such that Ha and Hb are both 
legal ( x in the table). 
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Table A.l: Dependencies between events in the token passing specification 



QB 

Qj-.F 

Rj : eR 

Pj(y ) : ei? 

Rj : ok 

Pj(y) : ok 



0 

0 

X 

0 

X 

Qi : F 

Pi(x):eH 

0 

0 

0 

o 

0 

*-+ y = i 

o y ± i 

Ri : eR 

0 

0 

0 

0 

0 

*-*■ y = i 
o y^i 

Pi(x) : eR 

X 

0 


X 

HB9 

X 

O 

* 

0 

0 

0 

0 

0 

x y = i 

o y ^ i 

Pi(x) : ok 

X 

0 

o 

X 

Hh 

X 


o The events commute, i.e., Ha = Hb, for all H. 

x The events are incompatible, i.e., there does not exist any history 

H such that Ha and Hb would both be legal. 

i— ► There is an ordering constraint between the two events (Defini- 

tion 5.4). 
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A. 3 Mutually Exclusive Events 

Next, we need to show that every plausible run has an acyclic closure. We exploit 
the fact that certain types of events are mutually exclusive. These axe ail events 
that indicate that its caller is currently holding the token. 

Definition A.l 

Bi = {Qi : T,Ri.eH} U {P x (i):ok | for all x} U {P{(x):eR | for all x} 

Lemma A.l 

The set B% contains all events that indicate that processor i is holding the token 
when the event occurs, i.e., 

a 6 B{ A Ha € 5 =*> CurHold^H) = t. 

Proof: Follows immediately from the token passing specification and the definition 
of B{. □ 

Lemma A.2 

Let B - s (J Bi. 

All events in B are mutually exclusive. 

The proof of a restricted form of this lemma was already presented in Section 5.2.3. 

Proof: Consider a run plausible run R with two events a € Bi and b 6 Bj. We 
have to show that a and b cannot be concurrent. We do this by induction on the 
number of events in R. 

Base case: R contains no events other than a and 6. Because R is plausible i?[a] = a 
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and f?(ftj = ft have legal linearizations H a = (a) and = (6) respectively. By 
Lemma A.l, H a = (a) € S and a € B{ imply that i = CurHold($) = 1. For the 
same reason j = CurHold (0) = 1. Therefore a, 6 € Bi- Hence a and ft cannot be 
concurrent, because all events in B\ correspond to operations invoked by the same 
processor (processor 1). 

For the induction step consider R with more than two events. Because R is plausible 
R[a\ and B[ft] have legal linearizations H a and Bj respectively. Let R! = J?[a 
By induction hypothesis H[a],H[ft], as well as R! do not contain any concurrent 
events from the set B. Therefore we can define the following events: 

c = Pi(z):ok The last event in Rf H B. 

a! = Pi(x):ok The first event after c in f?[a] n B. 

b' = Pj(y):ok The first event after c in /2[ft] D B. 

(where possibly, but not necessarily, a' = a or V = ft). Note that a! //V because 
otherwise either a! or V would be in Rf. Then the histories ff a and have the 
form 

H a — ... Pi(z):ok . . . Pi(x):ok ... a 

Hi = ... Pi(z):ok . . . Pj(y):ok ... ft 

with no pass events between c and a' in H a and between c and V in H\>. Then H a 
can only be legal if * = z, otherwise the operation Pi(x) should return an error code 
eH. For the same reason Hi is only legal if j = z. Hence i — j, i.e., a' and V are 
events at the same processor. But that contradicts a' //bf. □ 
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A. 4 Acyclicity 

We already proved the token passing specification to be acyclic in Section 5.2.3 of 
Chapter 5. For the sake of completeness we repeat this proof here. 

Theorem A.l 

The token passing specification is acyclic. 

Proof: Assume not. Let R be a plausible run that contains a cycle. By Lemma 5.6 
we may assume that the cycle only has concurrent segments. Consider the ordering 
constraint edges (‘W’) in such a cycle. The cycle cannot contain more than one 
edge of type (III), otherwise there would be two pass events in different segments 
of the cycle, which is not possible since segments are concurrent and pass events 
are mutually exclusive. For the same reason there cannot be more than one edge of 
type (I) or (II) in the cycle. By Lemma 5.5 the cycle has at least two ‘W’ edges; 
hence it must have exactly one edge of type (III) and one of type (I) or (II). Hence 
the cycle is of the following form: 

C = Pj(i):ok - Pk(l):eR 
i — * Rf.ok — ► e 
i-» Pj(i):ok 

where either e = Q,:F or e = Ri.eR. 

The first segment of the cycle consists of the two pass events a 
b = Pk(l):eR. If R is plausible than A[6] has legal linearization H b . 
a is in i?(6] and therefore also in H b . Hence Hi, has the form 

H b = ... Pj(i):ok ... P k (l):eR 


= Pj(i):ok and 
Because a — ► b, 
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Notice that the return value eR of the last event (the pass operation failed because 
processor / did not request the token) indicates that processor k is holding the token 
at that time. Therefore H\, must contain a pass event c = P,-(i):ofc between the two 
events a and b in Hy, otherwise processor t would still be holding the token at the 
end of H\f. From Theorem 5.3 we know that c cannot be concurrent with a or b ; 
hence 


a — ► c — ♦ 6. 

Now consider the event e in the second segment of the cycle. Events c and e cannot 
be concurrent, because the operations were both invoked at processor i. If c — ► e 
we have a c — ► e; hence a — ► e. If e — ► c we have e — ► c — ► ft; hence e — *• 6. In 
both cases the two segments of the cycle would not be concurrent, contradicting 
Lemma 5.6. □ 



Appendix B 

Invocation-Completion Model 


In our formal specifications we consider the execution of an operation to be one 
single event. This does not allow us to model operations that explicitly wait for 
another client to take some action (i.e., invoke another operation). Such wait- 
semantics operations can be modeled if we treat the invocation and the completion 
of an operation as two separate events. 

We argue that it is not necessary to do this: A specification with separate invo- 
cation and completion events can be transformed into an equivalent one-operation- 
one-event specification in which wait-semantics operations are implemented by a 
busy wait. This works as follows: 

Say specification 5 has an operation A with separate events for the invocation 
and completion of A (lNVOKEA(. . . ) and COMPLETEA:u). We transform 5 into S' 
by splitting A into two parts: 

StartA() and QUERYA(). 

We then make the following two modifications to 5: 
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1. Replace all invocation events INVOKE A(. . . ) by an event STARTA(. . . ):NIL. 
Replace all completion events COMPLETEA:u by an event QUERYA():u. 

2. Add additional histories to S that are obtained by inserting extra QUERY A 
events to existing histories: Insert an event QUERYAQ rPENDING anywhere be- 
tween STARTA(. . . ):NIL and QUERYA():v; insert an event QUERYA():DONE 
anywhere after QUERYA():u but before the next STARTA(. . . ):NIL. 

Then the effect of a client invoking A in 5 is the same as the client invoking START A 
in S' and then doing a busy wait 

while QUERY A() = PENDING do nothing. 

With this transformation, an implementation that satisfies the modified specifica- 
tion S' will be equivalent to one that satisfies the original specification 5. 
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